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Preface 


This book contains five invited expository articles resulting from the workshop 
“Large-Scale Inverse Problems and Applications in the Earth Sciences” which took 
place from October 24th to October 28th, 2011, at the Johann Radon Institute for 
Computational and Applied Mathematics (RICAM), Austrian Academy of Sciences at 
the Johannes Kepler University in Linz, Austria. This workshop was part of a special 
semester at the RICAM devoted to “Multiscale Simulation and Analysis in Energy and 
the Environment” which took place from October 3rd to December 16th, 2011. The 
special semester was designed around four workshops with the ambition to invoke 
interdisciplinary cooperation between engineers, hydrologists, meteorologists, and 
mathematicians. 

The workshop on which this collection of articles is based was devoted more 
specifically to establishing ties between specialists engaged in research involving 
real-world applications, e.g. in meteorology, hydrology and geosciences, and experts 
in the theoretical background such as statisticians and mathematicians working on 
Bayesian inference, inverse problem and control theory. 

The two central problems discussed at the workshop were the processing and 
handling of large scale data and models in earth sciences, and the efficient extraction 
of the relevant information from them. For instance, weather forecasting models in- 
volve hundreds of millions of degrees of freedom and the available data easily exceed 
millions of measurements per day. Since it is of no practical use to predict tomor- 
row’s weather from today’s data by a process that takes a couple of days, the need 
for efficient and fast methods to manage large amounts of data is obvious. The sec- 
ond crucial aspect is the extraction of information (in a broad sense) from these data. 
Since this information is often “hidden” or perhaps only accessible by indirect mea- 
surements, it takes special mathematical methods to distill and process it. A general 
mathematical methodology that is useful in this situation is that of inverse problems 
and regularization and, closely related, that of Bayesian inference. These two paths 
of information extraction can very roughly be distinguished by the fact that in the 
former, the information is usually considered a deterministic quantity, while in the 
latter, itis treated as a stochastic one. 

A loose arrangement of the articles in this book follows this structuring of infor- 
mation extraction paradigms; all in view of large scale data and real-world applica- 
tions: 

e Aspects of inverse problems, regularization and data assimilation. The article by 
Freitag and Potthast provides a general theoretical framework for data assimilation, 
a special type of inverse problem and puts the theory of inverse problems in context, 
providing similarities and differences between general inverse problems and data as- 
similation problems. Lawless discusses state-of-the-art methodologies for data assim- 
ilation as a state estimation problem in current real-world applications, with partic- 
ular emphasis on meteorology. In both cases, the need to treat spatial and temporal 
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correlations effectively makes the application somewhat different from many other 
applications of inverse problems. 

e Aspects of inverse problems and Bayesian inference. The survey paper by Reich 
and Cotter gives an introduction to mathematical tools for data assimilation coming 
from Bayesian inference. In particular, ensemble filter techniques and Monte Carlo 
methods are discussed. In this case, the need to incorporate spatial and temporal 
correlations makes cost-effective implementation very challenging. 

e Aspects of inverse problems and regularization in imaging applications. The article 
by Burger, Dirks and Miiller is an overview of the process of acquiring, processing, and 
interpretation of data and the associated mathematical models in imaging sciences. 
While this article highlights the benefits of the nowadays very popular nonlinear (11- 
based) regularizations, the article by van den Doel, Ascher and Haber complements 
the picture by contrasting these benefits with the draw-backs of lı -based approaches 
and by attempting to somewhat restore the “lost honor” of the more traditional and 
effective, linear l>-type regularizations. 


The review-type articles in this book contain basic material as well as many interest- 
ing aspects of inverse problems, regularization and data assimilation, with the provi- 
sion of excellent and extensive references to the current literature. Hence, it should be 
of interest to both graduate students and researchers, and a valuable reference point 
for both practitioners and theoretical scientists. 

We would like to thank the authors of these articles for their commendable con- 
tributions to this book. Without their time and commitment, the production of this 
book would not have been possible. We would also like to thank Nathan Smith (Uni- 
versity of Bath) and Peter Jan van Leeuwen (University of Reading) who helped review 
the articles. Additionally, we would like to express our gratitude to the speakers and 
participants ofthe workshop, who contributed to a successful workshop in Linz. 

Moreover, we would like to thank Prof. Heinz Engl, founder and former director 
of RICAM, and Prof. Ulrich Langer, former director of RICAM for their hospitality and 
for giving us the opportunity to organize this workshop at the RICAM. In addition, 
we would like to acknowledge the work of the administrative and computer support 
team at RICAM, Susanne Dujardin, Annette Weihs, Wolfgang Forsthuber and Florian 
Tischler, as well as the local scientific organizers Jörg Willems, Johannes Kraus and 
Erwin Karer. The special semester, the workshops and this book would not have been 
possible without their efforts. 

More information on the special semester and the four workshops can be found at 
http://www.ricam.oeaw.ac.at/specsem/specsem2011/. 


Exeter Mike Cullen 
Bath Melina A. Freitag 
Linz Stefan Kindermann 
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Melina A. Freitag and Roland W.E. Potthast 
Synergy of inverse problems and data 
assimilation techniques 


Abstract: This review article aims to provide a theoretical framework for data assimila- 
tion, a specific type of an inverse problem arising, for example, in numerical weather 
prediction, hydrology and geology. 

We consider the general mathematical theory for inverse problems and regular- 
ization, before we treat Tikhonov regularization, as one of the most popular meth- 
ods for solving inverse problems. We show that data assimilation techniques such 
as three-dimensional and four-dimensional variational data assimilation (3DVar and 
4DVar) as well as the Kalman filter and Bayes’ data assimilation are, in the linear case, 
a form of cycled Tikhonov regularization. We give an introduction to key data assimi- 
lation methods as currently used in practice, link them and show their similarities. We 
also give an overview of ensemble methods. Furthermore, we provide an error analysis 
for the data assimilation process in general, show research problems and give numer- 
ical examples for simple data assimilation problems. An extensive list of references is 
given for further reading. 


Keywords: Inverse problems, ill-posedness, regularization theory, Tikhonov regular- 
ization, error analysis, 3DVar, 4DVar, Bayesian perspective, Kalman filter, Kalman 
smoother, ensemble methods, advection diffusion equation, Lorenz-95 system 
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1 Introduction 


Inverse problems appear in many applications and have received a great deal of atten- 
tion from applied mathematicians, engineers and statisticians. They occur, for exam- 
ple, in geophysics, medical imaging (such as ultrasound, computerized tomography 
and electrical impedance tomography), computer vision, machine learning, statisti- 
cal inference, geology, hydrology, atmospheric dynamics and many other important 
areas of physics and industrial mathematics. 

This article aims to provide a theoretical framework for data assimilation, a spe- 
cific inverse problem arising, for example, in numerical weather prediction (NWP) 
and hydrology [48, 57, 58, 70, 83]. A few introductory articles on data assimilation 
in the atmospheric and ocean sciences are available, mainly from the engineering 
and meteorological point of view, for example, [20, 44, 48, 51, 63, 66, 71]. However, 
a comprehensive mathematical analysis in light of the theory of the inverse problem 
is missing. This expository article aims to achieve this. 

An inverse problem is a problem which is posed in a way that is inverse to most 
direct problems. The so-called direct problem we have in mind is that of determining 
the effect f from given causes and conditions ọ when a definite physical or mathe- 
matical model H in form of a relation 


H(~) =f (1.1) 


is given. In general, the operator H is nonlinear and describes the governing equa- 
tions that relate the model parameters to the observed data. Hence, in an inverse 
problem, we are looking for œ, that is, a special cause, state, parameter or condi- 
tion of a mathematical model. The solution of an inverse problem can be described 
as the construction of © from data f (see, for example, [22, 49]). We now consider 
the specific inverse problem arising in data assimilation which usually also contains 
a dynamic aspect. 

Data assimilation is, loosely speaking, a method for combining observations of 
the state of a complex system with predictions from a computer model output of that 
same state where both the observations and the model output data contain errors 
and (in case of the observations) are often incomplete. The task in data assimilation 
(and hence the inverse problem) is seeking the best state estimate with the available 
information about the physical model and observations. 

Let X be the state space. For the remainder of this article, we generally assume 
that X (and also Y) are Hilbert spaces unless otherwise stated. Let p € X, where & 
is the state (of the atmosphere, for example), that is, a vector containing all state vari- 
ables. Furthermore, let pg € X be the state at time tg and Mz : X — X the (generally 
nonlinear) model operator at time tg which describes the evolution of the states from 
time tx to time tg+1, that is, Mk+1 = Mk (@kx). For the moment, we consider a perfect 
model, that is, the true system dynamics are assumed to be known. We also use the 
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notation 
Ma, = Mx-ıMx-2***Mg4ıMg, k>€€No, (1.2) 


to describe the evolution of the system dynamics from time tọ to time tx. 

Let Yx be the observation space at time tg and fr € Yp be the observation vector, 
collecting all the observations at time tx. Finally, let Hg : X — Yx be the (generally 
nonlinear) observation operator at time tx, mapping variables in the state space to 
variables in the observation space. The data assimilation problem can then be defined 
as follows. 


Definition 1.1 (Data assimilation problem). Given observations fk € Yx at time tx, 
determine the states pg € X from the operator equations 


Ax (Pk) = fk, k=0,1,2,... (1.3) 


subject to the model dynamics My : X — X given by Mx+1 = Mx(Mx), where k = 
O51, 2,4. 


In numerical weather prediction, the operator M; involves the solution of a time- 
dependent nonlinear partial differential equation. Usually, the observation opera- 
tor Hx is dynamic, that is, it changes at every time step. However, for simplicity, we 
often let Hg := H. Both the operator Hx and the data fk contain errors. Also, in prac- 
tice, the dynamical model Mx involves errors, that is, My does not represent the true 
system dynamics because of model errors. For a detailed account on errors occurring 
in the data assimilation problem, we refer to Section 4. Moreover, the model dynamics 
represented by the nonlinear operators Mx are usually chaotic. In the context of data 
assimilation, additional information might be given through known prior information 
(background information) about the state variable denoted by © A EX. 

The operator equation (1.3) (see also (1.1)) is usually ill-posed, that is, at least 
one of the following well-posedness conditions according to Hadamard [33] is not 
satisfied. 


Definition 1.2 (Well-Posedness [49, 82]). Let X, Y be normed spaces and H : X > Y 

be a nonlinear mapping. Then, the operator equation H(@) = f from (1.1) is called 

well-posed if the following holds: 

e Existence: For every f € Y, there exists at least one p € X such that H (op) = f, 
that is, the operator H is surjective. 

e Uniqueness: The solution © from H (q) = f is unique, that is, the operator H is 
injective. 

e Stability: The solution p depends continuously on the data f, that is, it is stable 
with respect to perturbations in f. 


Equation (1.1) is ill-posed if it is not well-posed. 
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Note that for a general nonlinear operator H, both the existence and uniqueness 
of the operator equation need not be satisfied. If the existence condition in Defini- 
tion 1.2 is not satisfied, then it is possible that f € R(H). However, for a perturbed 
right-hand side f°, we have f° € R(H), where R(H) = {f e Y, f = H(p), p € 
X} is the range of H. Existence of a generalized solution can sometimes (for instance, 
in the finite-dimensional case) be ensured by solving the minimization problem 


min||f - H(@)||; , (1.4) 


which is equivalent to (1.1) if f E€ R(H). The norm || - ||y isa generic norm in Y. The 
second condition in Definition 1.2 implies that an inverse operator H~! : R(H) < 
Y — X with H7! (f) = @ exists. If the uniqueness condition is not satisfied, then it is 
possible to ensure uniqueness by looking for special solutions, for example, solutions 
that are closest to a reference element ~* € X, or, solutions with a minimum norm. 
Hence, at least in the linear case, uniqueness can be ensured if 


|f - H(@uni) y-mn|f Holy, (1.5) 
where ||Muni — P* lx = min{||9 - @*||x,9 € X, p isa minimizer in (1.5)}. The 
third condition in Definition 1.2 implies that the inverse operator H-!: R(H) £ Y > 
X is continuous. Usually, this problem is the most severe one as small perturbations 
in the right-hand side f € Y lead to large errors in the solution p € X and the 
problem needs to be regularized. We will look at this aspect in Section 2. 

From the above discussion, it follows that the operator equation (1.3) is well- 
posed if the operator H; is bijective and has a well-defined inverse operator Hg 1 
which is continuous. A least squares solution can be found by solving the minimiza- 
tion problem 

min ||fi - Helpolly » k=0,1,2..... (1.6) 


We can solve (1.6) at every time step k, which is a sequential data assimilation prob- 
lem. If we include the nonlinear model dynamics constraint Mk : X — X given by 
Pk+1 = Mx(Mx), over the time steps tk, k = 0,...,K, and take the sum of the least 
squares problem in every time step, the minimization problem becomes 


K K 
in > ||fk — Hy (px) ||¥ = min ¥ || fx — HiMx,o(@o) ||} j (1.7) 
X kz0 ae 


PrEX Z 


where Mx,o denotes the evolution of the model operator from time to to time tx, that 
is, Mko = Mk-ıMk-2 - - - Mo, using the system dynamics (1.2), and Mg, k = I. Both 
the sequential data assimilation system (1.6) and the data assimilation system (1.7) 
can be written in the form 7 

min||F —H(p)]), . (1.8) 
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with an appropriate operator H. Problem (1.8) is equivalent to H(p) = f (cf. (1.1)) if 
f € R(H). For the sequential assimilation system (1.6), we have H := Hy, f := fk 
and @ := Qx at every step k = 0,1,.... For the system (1.7), we have © := Mo, 


“Niel Lad 


In general, H is a nonlinear operator since both the model dynamics Mx and the ob- 
servation operators Hx are nonlinear. If the equation H(q@) = f is well-posed, then 
H has a well-defined continuous inverse operator H and R(H) =Y. 

Now, if H is a linear operator in Banach spaces, then well-posedness follows from 
the first two conditions in Definition 1.2, which are equivalent to R(H) = Y and 
N (H) = {0} where N (H) is the null space of H. Moreover, if H is a linear operator 
on a finite-dimensional Hilbert space (in particular, if R(H) is of finite dimension), 
then the stability condition in Definition 1.2 holds automatically and well-posedness 
follows from either one of the first two conditions in 1.2. (The last condition in Def- 
inition 1.2 follows from the compactness of the unit ball in finite dimensions [49].) 
For linear H, the uniqueness condition N (H) = {0} is clearly satisfied if the observ- 
ability matrix H has full row rank. In this case, the system is observable, that is, it is 
possible to determine the behavior of the entire system from the systems output, see 
[47, 73]. 

The remaining question is the stability of the (injective) operator equation 
H(p) = f (or Hp = H(p) = f, a notation which we are going to use from now on) 
for a compact linear operator H : X — Y in infinite dimensions. As a compact linear 
operator is always ill-posed in an infinite-dimensional space (as R (H) is not closed), 
we need some form of regularization. 

Note that the discretization of an infinite-dimensional unstable ill-posed problem 
naturally leads to a finite-dimensional problem which is well-posed, that is, accord- 
ing to Definition 1.2. However, the discrete problem will be ill-conditioned, that is, 
an error in the input data will still lead to large errors in the solution. Hence, some 
form of regularization is also needed for finite-dimensional problems arising from 
infinite-dimensional ill-posed operators. 

In the following, we consider compact linear operators H for which a singular 
value decomposition exists (see, for example, [49]). 


Lemma 1.3 (Singular system of compact linear operators). Let H : X — Y be a com- 
pact linear operator. Then, there exist sets of indices J = {1,..., m} for dim(R(H)) = 
m and J = N for dim(R(H)) = ©, orthonormal systems {uj} jej in X and {vj} jeJ 
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in Y and a sequence {0j} je; of positive real numbers with the following properties: 
{Oj}jej isnon-increasing and limo;=0OforJ=N, (1.9) 
J>% 
Hu;=ojv,(jeJ) and H*vj=ojuj, (JE). (1.10) 


Forallp € X, there exists an element po € N (H) with 


P=Pot+>. (p, uj) u; and Hp =). oj (P,uj) vi. (1.11) 
jeJ jeJ 
Furthermore, 
H*f = o Jupi (1.12) 
jeJ 


holds for all f € Y. The countable set of triples {0j, uj, Vj} jey is called a singular 
system, {Oj} jej are called singular values, {uj} je j are right singular vectors and form 
an orthonormal basis for N (H)+ and {vj} jej are left singular vectors and form an or- 
thonormal basis for R(H). 

In the following, we mostly consider compact linear operators, although the con- 
cept ofill-posedness can be extended to nonlinear operators [23, 40, 49, 82] by consid- 
ering linearizations of the nonlinear problem using, for example, the Fréchet deriva- 
tive of the nonlinear operator. One can show that for compact nonlinear operators, 
the Fréchet derivative is compact as well, leading to the concept of locally ill-posed 
problems for nonlinear operator equations. For solving nonlinear problems compu- 
tationally, usually some form of linearization is required. Hence, most of our results 
for linear problems can be extended to the case of iterative solutions to nonlinear 
problems (where a linear problem needs to be solved at each iteration). 


2 Regularization theory 


Problems of the form H@ = f with a compact operator H are ill-posed in infinite di- 
mensions since the inverse of H is not uniformly bounded. However, in order to solve 
Ho = f (or, for f € R(H), its equivalent minimization problem min | Ho — f||?), 
regularization is needed. 
Let H : X — Y and denote its adjoint operator by H* : Y — X. Furthermore, let 
@ be the unique solution to the least squares minimization problem min || Hp — f||?. 
Then, the solution to the minimization problem is equivalent to the solution of the 
normal equations 
H*Hy =H*f. (1.13) 


Clearly, if H : X — Y is compact, then H*H is compact and the normal equations 
(1.13) remain ill-posed. However, if we replace (1.13) by 


(ol + H*H) Qa = Pa + H* HP, = H* f (1.14) 
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with & > 0, then the operator («J +H*H) has a bounded inverse. The equation (1.14) 
is typically referred to as Tikhonov regularization and « is a regularization parameter. 
We have the following theorem (see, for example, [17, 40, 62, 78, 82]). 


Theorem 1.4 (Tikhonov regularization). Let H : X — Y be a compact linear operator. 
Then, the operator («I + H* H) has a bounded inverse and the problem (1.14) is well- 
posed for x > Oand Pa = (al + H*H)-!H* f is the Tikhonov approximation of 
a minimum-norm least squares solution ~ of (1.13). Furthermore, the solution Qx is 
equivalent to the unique solution of the minimization problem 


i ee _ Holl? 2 
min Ta(p) := min {|| f Holly +alplik}, (1.15) 


where Tx(q@) is the so-called Tikhonov functional. 


In general, Tikhonov regularization can be used with a known reference element 
p), that is, the term ||q||< in (1.15) is replaced by |p — p% |$, and the problem is 
often referred to as generalized Tikhonov regularization. We consider this problem in 
Section 3. 

We have the following definition for a general linear regularization scheme. 


Definition 1.5 (Regularization scheme). A family of bounded linear operators 
{Ra}a>0, Ra : Y — X is a linear regularization scheme for the compact bounded 
linear injective operator H if 


limR,Hp=p VpeEXx. (1.16) 


a-0 


Clearly, the family of approximate inverses R« = («I + H*H)~'H* : Y + X is 
a linear regularization scheme for H. If the range of H, R(H), is not closed, then 


lim ||Ral| = ©. (1.17) 
a-0 


If we apply the regularization operator R« to noisy data f with noise level 6, that is, 
\| f° — filly < 6, we get regularized solutions 


p% = Raf? . 


Using the singular system of a compact operator from Lemma 1.3, we may also write 
the regularized solution arising from Tikhonov regularization via the minimization 
problem in (1.15) as 


ô Gi ee 
Pam 2 ose (f Vj) y ted - (1.18) 


We observe that for & = 0, the solution p? amplifies the noise in f ô since for com- 
pact operators limj—.o Oj = 0. 

Furthermore, for the exact unique solution, we have p = H!f, where Ht : 
R(H) + R(H)+ — X denotes the Moore-Penrose pseudoinverse of H [82] and it 


8 —— Melina A. Freitag and Roland W. E. Potthast 


is continuous if R(H) is closed. Therefore, we may estimate the total regularization 
error 


lo- el], = Roll 5+ || Raf -H+ f|], 
or, for N (H) = {0}, 


oi- ol, < IRallö + IR«HP - ply - (1.19) 


Hence, the total regularization error consists of a stability component ||Rq||6 which 
represents the influence of the data error ô and a component |R«H@ - @||x which 
represents the approximation error of the regularization scheme. For small «, the sec- 
ond component will be small (1.16), but the first component will be large (1.17). How- 
ever, for large values of «, the first term will be small and the second one large. We 
will see this in the examples in Section 9. Hence, finding a good value for the reg- 
ularization parameter « is important. Techniques for regularization parameter esti- 
mation aim to find a reasonably good value for « (see, for example, [37, 38, 82]). The 
most prominent ones are the L-curve method, generalized cross-validation and the 
discrepancy principle. 

A regularization scheme is called convergent if from the convergence of the data 
error to zero, it follows that the regularized solution converges to the exact solution. 
One can show that a regularization scheme Ry = (al + H*H)-!H* : Y — X arising 
in Tikhonov regularization is a convergent regularization if «(6) — 0 and on — 0 
as ô > 0 [22]. For Tikhonov regularization, one may choose « = 0(6) such that this 
holds [82]. 

Other regularization schemes for inverse problems are also possible, some of the 
most famous ones being the truncated singular value decomposition (TSVD) and the 
Landweber iteration (see, for example, [22, 34, 35]). Moreover, it is possible to change 
the penalty term ||@ (es in (1.15). Other penalty functionals can be used to incorporate 
a priori information about the solution ©. Prominent methods are total variation reg- 
ularization or the use of sparsity promoting norms (like the Lı-norm, for example) in 
the penalty functional. There is a fast growing literature on this topic, see, for exam- 
ple, [1, 7, 13, 82, 86] and the articles by Burger et al. [10] and van den Doel et al. [81] 
in this book. 

In the following, we use the results from inverse problems and regularization the- 
ory to develop a coherent mathematical framework for several data assimilation tech- 
niques used in practice. 


3 Cycling, Tikhonov regularization and 3DVar 


Data assimilation aims to solve a dynamic inverse problem which includes measure- 
ment data f1, fo, f3,..., fx,... at various times tı < t2 < t3 < --- < tk < +--+. At 
every time tx, the inversion problem is given by (1.3). However, usually the data fpg do 
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not contain enough information to recover the state x at time tg completely. Thus, 
it is crucial to take the dynamical evolution of the states into account. 
Assume that we are given some reconstruction @ at time tk for some k € N. 
Then, we expect that 
p% = Me (p) (1.20) 


is a reasonable first guess for the system state at time tk+1, where Mx describes the 
model dynamics and is given in Definition 1.1. In data assimilation, ~) is called the 
background or first guess. At time tx: ı, we would like to assimilate the data fr. to 


calculate a reconstruction oa which is also called the analysis in data assimila- 


tion. Then, the background ee at time t,.2 can be calculated using (1.20) with k 
replaced by k + 1 and another reconstruction can be carried out at time tx+2. This 


approach is called cycling of reconstruction and dynamics. 


Definition 1.6 (Cycling for data assimilation). Start with some initial state pi” at 

time to. For k = 0,1, 2,..., carry out the cycling steps: 

(i) Propagation Step. Use the system dynamics M, to calculate a background @ e at 
time tķ+ı using (1.20). 

(ii) Analysis Step. With the data fx.) at time tx-ı (and the knowledge of the back- 
ground © a calculate a reconstruction or analysis @ A 


Increase the index k to k + 1 and go to Step (i). 


A key characteristic of a data assimilation system is its Analysis Step (ii). Here, 


for any step k, the task is to calculate a reconstruction © = using the data fk and the 


knowledge of the background @ fae We need to choose or develop a reconstruction 
method which optimally combines the given information. 

To carry out the analysis, we will study two basic approaches, one coming from 
optimization and optimal control theory, the other arising from stochastics and prob- 
ability theory. In this section, we focus on the optimization approach and Section 5 
will provide an introduction to the stochastic approach using Bayes’ formula. The re- 
lationship between the two approaches will be discussed in detail in Section 5. 

With a norm || - ||, in the state space X and a norm || - ||y in the data (or obser- 
vation) space Y, we can combine the given information at step k, namely, the obser- 
vation data fg € Y and the background g € X by minimizing the inhomogeneous 
Tikhonov functional 


Jep) = alp -pP |i + Ife — Holle (1.21) 


at time tg. H : X — Y is the observation operator defined in Section 1. With @x := 


p-p i , this is transformed into the Tikhonov functional (1.15) in the formula 


(1.22) 


Ion = cell Bell + | - Ho”) -Hoh - 
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According to Theorem 1.4, it is minimized by 
PP := (ol + H*H) 'H* (fe - Hy”), (1.23) 
leading to the minimizer 
pi” = piP + (al + H*H) * H* (fx - Hy”) (1.24) 


of the functional (1.21). We denote the cycling of Definition 1.6 with an analysis calcu- 
lated by (1.24) as cycled Tikhonov regularization. 

Often, data assimilation works in spaces X = R” and Y = R” of dimensions 
n E Nandm e N. The norms in the spaces X and Y are given explicitly using 
the standard L?-norms and some weighting matrices B € R"*” and R € Rx m, In 
Section 5, these matrices will be chosen to coincide with the error covariance matrices 
of the state distributions in X and the error covariance matrices of the observation 
distributions in Y. For the moment, we assume the matrices to be symmetric, positive 
definite and invertible. Then, we define a weighted scalar product in X = R” by 


(P, Yg- := pB yY, p, yEX=R", (1.25) 


and a weighted scalar product in Y = R™ by 


Far FTR g, f,gEY=R". (1.26) 


With the corresponding norms || - ||g-ı in X and || - ||p-1 in Y, we can rewrite the 
functional (1.21) into the form 


Jp) = a (p - Pi") B (p- Pl) + (fe - H@)™ RO (fe Hp). (1.27) 


In the framework of the cycling given by Definition 1.6, this functional is known as 
the three-dimensional variational data assimilation scheme (3DVar), see, for example, 
[20, 51]. Often, the notation x and x") for the state and the background, as well 
as y for the observations, is used in the meteorological literature of data assimila- 
tion. Here, by building a bridge to the functional analytic framework, we will use 
g € X for the states and f € Y for the observations. Also, x, y will be points in the 
physical space R°, respectively. This is also advantageous when we employ ensemble 
methods and analyze localization techniques. 

The functional (1.27) can easily be transformed into the general Tikhonov regu- 
larization form. By H’, we denote the adjoint operator of H with respect to the stan- 
dard L? scalar products in X = R” and Y = R”. The notation H* is used for the 
adjoint operator with respect to the weighted scalar products (.,.)g-ı and (.,.)p-1. 
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Then, we calculate 


a 
ron) 
(H'R-!p, BB tw) 
(BH'R-'p, By) 
(BH'R', TR 
= (H’p,w);-ı, 


(p ‚Hw)r-ı 


(1.28) 


leading to 
H* = BH'R!. 


This means that the minimizer (1.24) of (1.21) with the norms based on the scalar prod- 
ucts (1.25) and (1.26) is given by 


pi? = py? + (ol + H*H) 1H* (fk - Hy”) 
ae) 'p-lpyy-1 'n-1 (b) 
pi” +(al+BH’R!H)!BH'R!(fk-Hop%). 


(1.29) 


The operator «J + H*H maps the state space X into itself. In large scale data assim- 
ilation problems, the dimension n of the state space is often much larger than the 
dimension m of the data space Y. In this case, the inversion of «J + H*H is not feasi- 
ble, and it is advantageous to derive a different form of the update formula known as 
measurement space inversion. Using the invertibility of the operators «J + H*H in X 
and al + HH* in Y, we start from 


(al + H*H)H* = H*(al+HH"*). 


We multiply with the inverse («J + H*H)~! from the left and by («I + HH*)~! from 
the right to obtain 


H* (ol + HH*)~! = (al + H*H)-!H*. (1.30) 
With the help of (1.30), we transform (1.29) into 


pP = p” + H* (al + HH*) (fk — Hep”) 
= pf” + BH'R (ol + HBH’R) "(fx - Hop) (1.31) 
= pl”) + BH’ (aR + HBH’) "(fy - Hp”). 


Here, the inversion of («I + HH*) or («xR + HBH’), respectively, takes place in the 
space Y = R”. The solution is then projected into the state space by the application 
of BH’. In the meteorological literature of data assimilation, the solution (1.29) is of- 
ten referred to as the solution arising from Optimal Interpolation (OD) [29, 68]. It refers 
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to a direct method being used to solve the 3DVar minimization problem (1.27) rather 

than an iterative optimization technique. In the linear case, Optimal Interpolation 

and 3DVar are equivalent. Method (1.31) is often called the PSAS (physical space sta- 

tistical analysis) scheme in the literature on meteorology and oceanography [16, 18]. 
We summarize our results in the following theorem. 


Theorem 1.7 (Equivalence of cycled Tikhonov regularization and 3DVar). 3DVar or three- 
dimensional variational data assimilation (1.29) or (1.31) is equivalent to cycled Tikhonov 
regularization (1.24) when the norms are arising from the weighted inner products (1.25) 
and (1.26). 


Theorem 1.7 shows that 3DVar is merely a cycled Tikhonov regularization in an ap- 
propriately chosen norm. 


4 Error analysis 


In this part, we investigate the error arising in data assimilation, that is, we consider 
the error between the true solution and the solution obtained from a data assimila- 
tion scheme. The solution obtained from solving a data assimilation problem is often 
referred to as analysis in the data assimilation literature. As a generic method, we 
will study cycled Tikhonov regularization, which, according to Theorem 1.7, includes 
three-dimensional variational assimilation. We will later see that this also carries over 
to cycled four-dimensional variational data assimilation, which we will discuss in 

Section 6. 

We need to take into account errors which can arise when we cycle the update for- 
mula (1.24) according to Definition 1.6. Assume that poe is the true state at time tx, 
k = 0,1,2,... and JS are the true values of the data. The errors we need to take 
into account include 
(1) Measurement error: Errors in the data fx, that is, we measure fè with a data error 

dd := fÒ — f of size ||dÖ|| < 5. This error was discussed in Section 2 and 

arises through errors in the measurements and noisy data. 

(2) Observation operator error: Errors in the measurement operator H, that is, we use 
a measurement operator H which is different from the true mapping H“™® of the 
state o to the data f. 

(3) Reconstruction/approximation error: Reconstruction errors by using the inverse 
Ry = (aI + H*H)-!H* as an approximation to the inverse H~! of H. This error 
was discussed in Section 2. 

(4) Model error: The model operator which we defined in Section 1 is usually only 
an approximation M to the true system dynamics MW), Model error arises as 
the dynamical model does not usually describe the system behavior exactly. It in- 
corporates numerical error arising from discretization of the partial differential 
equations that need to be solved and includes inaccuracies in the physical pa- 
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rameters, forcing terms and as well as in the model itself which is usually merely 
a simplification of the reality. 

(5) Accumulated errors: There will be accumulated errors in the background in the 
sense that the analysis error from the previous step leads to an error in the back- 
ground of the next step in contrast to the background which would be arising 
from the true state g™®), 


In every analysis step of the assimilation, we obtain an error contribution by the mea- 
surement error, by the error in the observation operator H and by the regularization 
operator Ry approximating the inversion of H. For the propagation step, we obtain 
an error caused by the model M approximating the true dynamics Mo), Moreover, 
the errors may accumulate over time. 


a . t 
Theorem 1.8. The evolution of the analysis error ex := p‘® - pine for cycled 
Tikhonov regularization and three-dimensional variational en is given by 
reconstruction error propagation of previous error and model error 


aaia TESES eS 
ek+1= (I-RaH) {Meer + (Mx - My) pl} 


: (1.32) 
data error influence observation operator error 
—— mm nn 
+ Bad + Re (Ge = H) pf’) i 


Proof. We know from Theorem 1.7 that 3DVar and Tikhonov regularization are equiv- 
alent. We use the update formula (1.24) and the Tikhonov regularization operator 
«= («I + H*H)-1H*. With (1.20), as well as 


p (ud) _ mie p\ (true) and fe = He p ae l 


and subtracting PX" from p{®, we calculate 


E (a (true) 
ek+1 : = Pk+1 ~ Pk+1 


= pk PEP + Ral fier - A) 

i "a Gi Hoi) (1.33) 
_ My u +R yg 

Er RB ey - Hey.) 


= M| p py”) + (Mi = Me™) pe Raley 


+ Ra (HO - Hp? + Hpi? - p®,)). (1.34) 


We treat the last term in (1.33) similarly to the first term in (1.34). Then, collecting all 
parts, we derive (1.32). 
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If the model error and the error in the observation operator in Theorem 1.8 is 
excluded, we obtain 


ek+1 = Rad, | + (T- R«H) Mex, 


and, taking norms and using la? || < 6, this is precisely the regularization error aris- 
ing in Tikhonov regularization (1.19). If we select an appropriate value for «, this error 
can be made very small. 

However, in many (practical) cases, the errors arising from the model and the 
observation operator are much bigger than the regularization error. Model error, in 
particular, can be very large due to insufficient resolution and inaccuracies in the 
physical model dynamics. This is specifically the case for a chaotic behavior of the 
system. The model error is a very important part of the total error and a very active 
area of current research (see, for example, [14, 27, 52, 80, 87]). 

We also notice that even if there is no model error, no observation error and no 
data error, then ex;ı = (I — R«H)Mkex, and the errors can accumulate if « is cho- 
sen too large, in particular, if ||(I — R«H)Mk|| > 1 (see also [60, 67]). Note that for 
any regularization scheme, condition (1.16) holds and therefore œ needs to be chosen 
small enough. 

We have shown that within cycled data assimilation schemes, various forms of 
errors occur and influence each other which is important to consider when applying 
data assimilation methods in practice. 

We will see in Section 6 that cycled four-dimensional variational data assimila- 
tion can be covered by the same framework of error analysis since cycled 4DVar is 
a form of cycled nonlinear Tikhonov regularization. 

In the remainder of this article, we assume that no model error is present, that is, 
the model operator M; represents the perfect model dynamics. 


5 Bayesian approach to inverse problems 


Probability theory provides a wide set of tools which can be used to solve inverse 
problems. In particular, the Bayesian theory has become quite popular as a generic 
approach which can be applied to inverse and ill-posed problems as well (see, for 
example, [5, 12, 75, 85]). 

Bayesian theory has the potential to provide a stochastic background for many 
ideas which might appear ad hoc in the area of deterministic inverse problems and 
functional analysis. Also, Bayesian theory provides much more than just a solution 
to the inverse or data assimilation problem, but a full-grown theory to calculate esti- 
mates for the uncertainty as well. 

However, we will see that all algorithms which can be formulated on a Bayesian 
background have their deterministic counterpart and, alternatively, can be studied 
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purely within the framework of functional analysis and optimization. In this section, 
we apply Bayesian ideas to the observation and background errors. 
Let us consider the equation 


H(op)=f, (1.35) 


asintroduced in (1.1) as a starting point, where in this section we assume that X = R" 
and Y = R™, m,n € N. The more general case with probability measures on infinite- 
dimensional spaces can be done formally in a similar way, but involves some nontriv- 
ial technicalities. 

In the stochastic framework, the task of inverting equation (1.35) given some mea- 
surement f does not ask for one special solution. Since f is just one draw from some 
random distribution Try, any particular solution is of limited value and significance, 
but we want to know the conditional probability distribution of @ given some informa- 
tion about the error distribution of f. This conditional distribution can then be used 
either to calculate an expectation value for p given f or to evaluate the uncertainty of 
this estimate measured, for example, by its variance. 

We need to formulate our setup in more detail and with well-defined spaces and 
operators. Stochastic theory assumes that the quantity @ is a random variable on 
some probability space (Q, >=, P) with values in X. Here, £ denotes some o-algebra 
and P is a probability measure which maps any subset A C Q for which A € X into 
a number P(A) € [0,1]. Also, P(A) is the probability of the set A. We then obtain 
a probability Py of the values of œ to be in some set C C X by 


Px(peC):=P({w:p(w)ecC}). (1.36) 


We also assume that the measurement f is a random variable with some probability 
distribution Py on Y. This probability distribution will depend on the true value fe) 
and is our model for measurement error during the process of measuring f. Here, we 
assume that the probability distribution (1.36) on X has a probability density try : 
X — [0,1] such that 


Px(C) = ESOL (1.37) 
E 


for every open subset C C X. In the same way, we assume that Py has a probability 
density Try on Y such that 


Py (U) = [ mwas, 
U 


for every open subset U c Y. Usually, for simplicity, we drop the letters X and Y. 

Clearly, since the conditional probability of some event C C X given some event 
Č c X is defined by P(C|C) := P(C n C)/P(C), we have that the conditional proba- 
bility of event C given U is 


P ({w : p(w) € Cand f(w) € U}) 
P({w:f(w) eU}) i 


P(C|U) 
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where P({w : f(w) € U}) > 0. In terms of the probability density functions (PDFs), 
the conditional probability is formulated by 


T(P, f) 
TT = ——_ 1.38 
(PIA) = Rh (1.38) 
where Tr (9, f) is the joint probability density of @ and f living on the space X x Y 
and 7r(f) + Ois the probability density of f in X. Equation (1.38) also holds with the 
role of p and f exchanged, i.e. we have 


T(P, f) 
TT = 1.39 
(fip) = -7 (o) (1.39) 
assuming that mT (9) + 0. Now, from equations (1.38) and (1.39), we get the famous 
Bayes’ formula for conditional probability densities, that is, 


m(plf) = H o 


Note that the value of mr (f) can be obtained by the knowledge that the integral of 
tr(p| f) over the whole space X should be equal to one, i.e. it is not necessary to 
know rt (f) (it is merely a normalizing constant). 

Bayes’ formula now provides a “simple” solution to the stochastic inverse prob- 
lem of inverting equation (1.35). Given a probability density Tr (Q) on X and some 
error density 7 on Y which can be used to calculate the density of the data distribu- 
tion (often called the “measurement model” in statistics), 


(1.40) 


r(fl\p)=nm(f-H(p)). (1.41) 


We employ (1.40) to calculate the conditional probability density function T (p| f). 
This probability density is also known as posterior density or analysis density function. 
It is the density of the unobservable € X given the data f € Y, that is, the prob- 
ability of observing the data f as a function of . The density function m (p) on X 
is denoted as prior density. The posterior density is considered as the solution to the 
inverse problem. 


Remark 1.9. Note that Bayes’ formula seems to provide a very easy and stable 
approach to solving the inverse problem. The calculation of the posterior densi- 
ty ™(@|f) is obtained by a multiplication of two given distributions m (Q) and 
1 (f — H(@)). However, the calculation of the mean of the posterior distribution 
involves the solution of an ill-posed equation. In general, the full ill-posedness of the 
task is implicitly involved in Bayes’ data assimilation as it is in all other schemes as 
well. 


We can now formulate a general approach to data assimilation based on Bayes’ 
formula. 
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Definition 1.10 (Bayes’ data assimilation). Bayes’ data assimilation determines prob- 
ability density functions > at time tx for the states p € X given data fk € Y at 
time tx by cycling the following propagation and analysis steps: 

(i) Propagation Step. Calculate the prior density mi” (œ) at time tg by propagating 
the analysis density u from time tz—1 to tk based on the (linear or nonlinear) 
model dynamics Mx-ı. 

(ii) Analysis Step. Calculate the posterior or analysis density a (p| fx) at time tk by 
Bayes’ formula (1.40) using the measurement model (1.41). 


An important special case of Bayes’ formula is the setup where all densities are 
normal or Gaussian distributions. For the prior distribution, we assume that it is 
a multivariate Gaussian distribution, that is, the probability density function is given 
by 


1 1 Tp-1 
-z (P-u) B= (Pu) n 
TT = ——— e? F ER 1.42 
(p) or det(B) p (1.42) 
around some state u := p”? € X = R” with some symmetric positive define matrix B. 
Gaussian densities are completely determined by their mean value u = E(p) € R” 
and the matrix B, which is well known to be the covariance matrix, that is, 


B=E((p -pp -p)7), (1.43) 


of the Gaussian distribution (1.42). We write 9 ~ N (u,B). The normalization is 
based on the integral formula 


ee | (21r)" 
sp Bp = Tee re n 
je dp det(B—1) \/(277)” det(B). 


Let us study the case where the probability density m(f|q) of the measurements f 
is also given by a Gaussian distribution with probability density function 


1 -1(f-H(p))TR-!(f-H(p)) m 
(FIM) Oare » JER”, (1.44) 


around the values H(p) € Y = R” with the symmetric positive definite covariance 
matrix R € R"*M of the observation error. Then, according to Bayes’ formula (1.40), 
we obtain 


m(plf) « exp {=F (@ ~ WTB (p — p) + f — H(p)) TRUS -H(p)))} 


for the probability density function of the posterior distribution. If H is linear, this is 
again a normal distribution with probability density 


1 . l 
no exp} -0E o- i. 
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Using u = pP), its mean fi is given by 
f= op +BH*(R + HBH*) (f —Ho™) = pP) +K(f —Hp™), (1.45) 

and its covariance matrix B is given by 
B = (B7! + H*R!H) | = (I-KH)B, (1.46) 


where K = BH*(R + HBH*)-! is called the (Kalman) gain. The proof of (1.45) and 
(1.46) will be worked out in detail in Section 7 on the Kalman filter, see equations 
(1.77) and (1.79). The equivalence of the two different expressions in (1.46) can also 
be obtained via the Sherman-Morrison-Woodbury formula (see, for example, [31]), 
though here it is worked out elementarily in Lemma 1.16. We summarize the above 
arguments in the following theorem. 


Theorem 1.11 (Bayes’ data assimilation for Gaussian probability densities). In the 
case of a linear observation operator H, assume that the prior distribution is Gaus- 
sian with probability density function Tr (Q) and the same is true for the distribution of 
the measurements with probability density function Tt (f|q) as given in (1.44). Then, 
the posterior distribution with density function Tr(@|f) is Gaussian as well. Its mean 
is calculated by the update formula (1.45) and its covariance matrix is given by (1.46). 


Note that the update formula (1.45) for the mean of the posterior Gaussian dis- 
tribution is the same as for the update vector (or reconstruction) © a obtained from 
(cycled) Tikhonov regularization (1.31), which is equivalent to 3DVar. In this respect, 
we see that Bayes’ data assimilation gives more information by calculating a whole 
probability distribution of a state estimate, whereas Tikhonov regularization/3DVar 
only provides the mean of the estimate. 

Further, when the dynamics M of a dynamical system is linear, then it maps 
a Gaussian distribution into a Gaussian distribution. The covariance matrix B in (1.45) 
and (1.46) needs to be replaced by its transported version BP) calculated from the 
matrix B at the previous assimilation step by B®? := MBM*. The propagation B®) 
arises from the definition of the covariance matrix (1.43) and the linearity of the ex- 
pected value. In this case, we can formulate the full cycling of the Bayesian approach 
explicitly. 


Definition 1.12 (Gaussian Bayes’ data assimilation for linear systems). For linear dy- 
namical systems Mx and linear observation operators Hx, we start with some prior 
distribution with probability density function mn” (œ) given by its mean pP and 
its covariance matrix BO; Then, for k = 1,2,3,..., we carry out Bayes’ data assimi- 
lation by cycling the following propagation and analysis steps. 


(i) Propagation Step. Calculate the mean state pi and the covariance matrix BP 


of the prior density mi” (g) at time tx by 


b b 
Oe = Moe, BP = MyBO Mé (1.47) 
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(ii) Analysis Step. Calculate the Gaussian posterior or analysis density m (ol fx) at 
time tx by its mean and covariance 


=i 
pi” = pi” +B HE (R+ HBP HŽ) (fe - Hp), (1.48) 


(519) = (BP) + HERE 1.) 


The above calculations treat the case of linear systems. Of course, Bayes’ formula 
also works for nonlinear dynamics and nonlinear observation operators for which 
the numerics is much more difficult to carry out efficiently. A numerical method to 
approximately calculate the densities by ensemble approaches will be introduced in 
Section 8. 


6 4DVar 


A natural approach to the solution of a time-dependent state estimation problem is to 
put all available measurements into one big minimization problem. Given measure- 
ments fk+1,-.., fk+K € Y, this leads to 


Jk(p) := llo — oP + 5 | fies - HM gs jx() |], (1.50) 
= 


where Mx+ j,k is defined in (1.2). For simplicity, we use a fixed (possibly nonlinear) ob- 
servation operator H. Similar to the approach in Section 1, we can rewrite the problem 
(1.50) in a 3DVar type form like (1.21) by putting all the measurements fk+1,---, fk+K 
into one long vector and removing the sum and defining a new (possibly nonlinear) 
operator Hx, that is, 


Tul) = |o -oPh + |F- ro], . 


where 
Fr+ı HMk+ı1,k 
= Sk+2 _ AHMk+2,k 
fr = : and Hx = 
frk+K HMk+K,k 
The minimization of (1.50) corresponds to the fit of the full dynamic trajectory of the 
states to the given measurements fk+j, J = 1,...,K over the time window between tx 


and tx+x. As in Section 3, we can transform the functional (1.50) into a (generally non- 
linear) Tikhonov functional of the form (1.15), for example, [28, 45]. Note that some- 
times the observation fx at time step tx is included in the sum (here, in the functional 
(1.50) it is not included). 
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Denote the minimum of (1.50) by & ‘ee A cycling of the assimilation is then ob- 
tained by using a new background at time tk; x defined by 


PiK = Merce (Pi), (1.51) 


for k = 0,K,2K,3K,.... The process of minimizing the functional (1.50) and us- 
ing the minimizing & as the initial condition for the forecast is known as four- 
dimensional variational data assimilation (4DVar) [6, 19, 50, 51, 72]. The repeated 
minimization of (1.50) combined with (1.51) is then a cycled 4DVar scheme. As we can 
write 4DVar in the form of 3DVar, this is merely a form of (nonlinear) cycled Tikhonov 
regularization as shown in Section 3. 

Usually, the minimization of (1.50) is carried out by a gradient method, that is, we 
calculate the gradient Vo Jk(p)|p% at points @'® in the state space and update 


pet) =p” — AV pJk(P) pw (1.52) 


with some appropriately chosen step-size h > O and starting guess m (often 
p := pP is used). 

For simplicity, we consider the case where X = R” and Y = R™, and the scalar 
products are the I? scalar products. Let us study terms of the form 


g(p) := ||f -HMe@|ly , (1.53) 


with f € Y and some linear operator M : X — X. The gradient of g (g) with respect 
to œ is given by 
Vpg(p) = -2(M*H*(f-HMY)). (1.54) 


If M is a nonlinear operator, then we obtain the nonlinear version 


dM(o) 
dp 


$ * 
Veg(p) = -2[( ) H*(f-HM(p)) (1.55) 
of (1.54), where dM(q)/dq denotes the Fréchet derivative of M (q) with respect 


to o. The derivative 
_ dM(9) 


dp 
is also known as the tangent linear model [26, 50]. 

For many applications, the dynamical model is given as a system of ordinary dif- 
ferential equations in the form 


M(q@) : 


(1.56) 


P=F(p), (0) = po. (1.57) 


Since the model dynamics is given by p(t) = Mı.o(P(0)) = Mt,o(@o), this means 
that 


d 
F(p) = ap to (P0). (1.58) 
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We denote the derivative with respect to the initial state po by 


rp). AP 
Q(t) := don’ (1.59) 
Note that ®’ is a linear mapping from X into X; when X = R”, itis the n x n-matrix 
with elements 0~;/0@o,i for i, j = 1,...,n. 

We assume that the solution © = g(t) is continuously differentiable with re- 
spect to the initial state po as well as with respect to the time t. In this case, we can 
exchange the differentiation with respect to time t and the initial state pọ and, dif- 
ferentiating (1.57) with respect to po, we obtain 


dp ad d d, 
Fee ap to (Po) = dt doo oP) =g? © (1.60) 
Therefore, the time evolution of the derivative g’ is given by 
ad, a Be dp (t) 
ae (t) = aon V =F N ipa : (1.61) 
At time t = 0, this is equal to F' (po) = dF (@)/d@o|p=qp; that is, 
d K = 1 
P (t)lt-0 = F (po). (1.62) 
This means that the tangent linear model @’ can be calculated by solving the system 
d + 1 + 
ape (t)=F (p(t))p (t), t>0 (1.63) 


of ordinary differential equations with initial condition ọ’ (0) = I and with the so- 
lution @ of the original system of equations (1.57). Using @(t) = Mrt o(po) and 
@'(t) = dM,,o(Po)/d@o as well as (1.56), we obtain 


_ AMt,0(Po) 


"(t 
p (t) En 


=:M;,0(90) 


for the tangent linear model. 

We remark that the tangent linear adjoint is an n x n matrix which might be 
huge when n is large. Thus, efficient methods for its evaluation need to be setup. To 
evaluate the adjoint in (1.54), we define a function y (t) € X on the interval [tx +1, tx] 
by 

p= —-F'(p(t))* wt), (1.64) 
with final condition 
W(tre+1) = H* (fk+ı — HM(px)). (1.65) 
Lemma 1.13. Fort € [tx, tk+ı ], the inner product 
h(t) := (p'(t)(6@o), W(t) 


is constant over time for any Qo € X. 


22 —— Melina A. Freitag and Roland W. E. Potthast 


Proof. We differentiate h(t) with respect to t and calculate 


o a (91590), win) 
-(& '(t)(5po), Wr) + (PH (So), u) 
= (FPDP (t) (Spo), Wit) + (Et (Spo), -F (P(t) * W(t) 
= (P'E Epo), F (p(t))* PE) — (p(t) (Spo), F (p(t) *wit)) 
z0 (1.66) 


where we have used (1.63) and (1.64). Since the derivative of h(t) is zero by (1.66), we 
obtain the statement of the lemma. 


Let ej, j = 1,..., n be the canonical basis of R”. We can now calculate the gra- 
dient V g of (1.54) by 
Vg; (Ppr) P'(tk+1)ej, H* (firs - HM(Px))} 


=24 
E (tk+1)€j, a. (1.67) 


op (tk) ej, yY (tx)) 
= Een = —2(tk)j 


for j = 1,...,n. Thus, the gradient is calculated by propagating the field forward 
in time by (1.57), then propagating the observation error back by (1.64), (1.65) and 
calculating the gradient using (1.67). 

In general, we consider the time step tx as the initial time step or, subsequently, 
the intermediate time step, and thus (1.57) becomes 


P=F(p), p(0)=pk, where Mx := P(t), (1.68) 
and the derivative ’ with respect to the initial state @x is given by @’(t) := ie. 


Hence, discretizing (1.68) using, for example, a simple finite difference between time 
steps tx and tz, leads to 
Pk+1 = Pk 
At 
and therefore the discretized model operator Mk from time step tx to time step tk; 1 is 
given by 


= F(x), (1.69) 


Pk+1 = Pk + AtF(pr) = Mk(pPk) = Marin (Pr). 


Moreover, discretizing (1.63) leads to 


= F' (px). (1.70) 
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Hence, using 9, = dp, /dop, = I, the (discretized) tangent linear model is given by 


; ; aM, AMk+1,k 
Pray = 1+ ALF (Pk) = Me (Me) = Mk+1,k (Pk) := do lox = io 


loks 


which can also be obtained by differentiating (1.69) with respect to @x. Note that we 
can similarly find the (nonlinear) operators Mx; ‚x and their tangent linear models 
Mk+ j,k (Pk) := a |p, for any j = 1,..., K, and, by the chain rule applied to (1.2), 


it follows that 


Mk+ j,k (Pk) = Mk+ j,k+j-1 * * + Mk+2,K+1Mk+1,K (PR). 


Studying the case X = R” and Y = R”, and using the weighted scalar product 
(1.25) and (1.26), we may compute the gradient V »Jx(@) of the full functional Jk (@) 
given in (1.50) by 


K 
VoJk(p) := 2B! (p - My”) -2 X Mes je(@)*H*R™! (fiers —HMerie(P)) - 
j=1 
’ (1.71) 
A gradient method like (1.52) can then be used to obtain a local minimizer for the 
functional Jx(q@) in (1.50). Another method which may be used to find a local mini- 
mum of Jx(q) in (1.50) is the Gauss-Newton method [21]. We solve VpJk(@) = Oin 
order to find the minimum of (1.50) using Newton’s method, that is, 


N =I 
pD = p” —(VVohkPlpo) Vose(P)lpos 


with some starting guess pP) where V V pJk(P) low is the Jacobian of Vy Jx(@) at 
m™), that is, the Hessian. Usually, the starting guess p = p A is taken. Often, 
instead of the correct Hessian VVgJk(p)| pi), an approximate version is used, ne- 
glecting terms involving the gradient of the tangent linear model, thereby leading to 
a quasi-Newton method. The gradient method usually only gives linear convergence. 
The Gauss-Newton method with approximate Hessian converges superlinearly for 
well-posed problems and a sufficiently close starting guess. For linear observation 
operators H and linear model dynamics Mx, the Newton and Gauss-Newton method 
are the same and any local minimizer of (1.50) is clearly also a global minimizer (see, 
for example, [32]) and the convergence speed to the global minimum is quadratic. 


7 Kalman filter and Kalman smoother 


The Kalman filter is a method to solve the data assimilation problem (1.3) similarly 
to the cycled Tikhonov regularization, 3DVar or 4DVar. But in addition to calculating 
an analysis in every step, it also iteratively updates the norm of the state space to 
include the knowledge from previous assimilation cycles. 
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We can introduce the Kalman filter using deterministic and stochastic arguments. 
Here, we will start with a deterministic approach, which also proves equivalence of 
the Kalman filter and Kalman smoother to the four-dimensional variational data as- 
similation for linear model dynamics Mk : X — X and linear observation operators 
H : X — Y. Then, we discuss a stochastic approach to the Kalman filter. 

Let us study assimilation for a linear model dynamics Mx, a linear observation 
operator H and measurements fı and f» at times tı and t2. Then, four-dimensional 
variational data assimilation with weighted norms as in Section 3 minimizes the func- 
tional (1.50) 


Japvar(P) = |lp - ol, + Ifi- HM- + Ife - HM Moola, (1.72) 


with B € R”*” and R € R™”*™., Alternatively, we study the assimilation of the data fı 
in a first step by minimization of 


i) = |o - |, + ILA -Amol (1.73) 


with minimizer @ and the assimilation of fz in a second step by minimizing 


Ip) = |o - 6 ||, + 11% - HM Molle, (1.74) 


with a weight matrix B. The key question here is to determine the new weight B such 
that the minimizer of J2 is equal to the minimizer of the full functional Japvar in 
(1.72). This is the case if we can choose B such that Jo(@) = Japvar(@) + c with 
some constant c, where Jı is implicitly used via &'® in (1.74). The problem is solved 
if we can determine @'® and B such that J; and the first term of Jọ are identical. 
Starting with Jı, we obtain 


Jip) = (p - vy”, Bp - vy”) 
+ (fi - HMoy,R "(fi - HMoQ)) 


(1.75) 
= (p, (B7! + M§H*R-'HMo)@) 
- 2(p, Bpi” +M$H*R fi) +c, 
with some constant c independent of g. The first term of J? is given by 
llo - pl, = (p, Bp) - 2(p,B 1G) + (1.76) 


with some constant © not depending on @. A comparison of the coefficients of the 
quadratic and linear terms in (1.75) and (1.76) immediately shows that with 


Bo! := B7! + M H*R-1HMo (1.77) 


and 
BPA := Bp” + MH*, R! fi (1.78) 
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the functional Jı given by (1.75) and the first term of the functional J» given by (1.76) 
are the same up to some constant not depending on @. Finally, from (1.78) using (1.77), 
we derive 


gp = (Bp + MS H*R"' fi) as 
1.7 
= (I + BMŽH*R HM)! (pý + BM H*R fi). 
After some algebraic manipulations inserting 

I = (I + BM$H*R!HMo) - BM§ H*R-'HMo, 


we obtain 


=i 
Pp = pP + (I+BMZH*RHMo) BMŽH*R! (fı —HMopy”) 


= py” + BM H* (R + HMoBMg H*) | (fı - HMw”), 


which is the minimizer of Jı as in (1.29) or (1.31) when the propagation Mo from @o 
at time fo to @ı at time tı, that is, pı = Moo is used. The above approach can be 
carried out successively for the measurements fi, fo, f3 etc. This sequential approach 
leads to the Kalman smoother (see, for example, [27, 53, 59]). We will see later in 
Theorem 1.18 that the Kalman smoother is equivalent to the Kalman filter at the final 
time. 


Definition 1.14 (Kalman smoother (KS)). Let Hk : X > Y and My, :X > X,k = 
0,1,2,... given in Definition 1.1 be linear and assume that measurements fi, fo,... 
at times tı,t2,... are given. Then, we calculate weight matrices 


By! := Bgl; + MgoHfR'HiMeo, k=1,2,..., (1.80) 


with Bo := B, where M, x,o is defined in (1.2), and analysis states ® at time tx defined 
by 


P * pr* D nn ~ (a) (1.81) 
+ By -iMp gH (R+HxMxoBraMgoHi) (fe - HeMro®”,) 


fork = 1,2,... with pi” := pin, 


From our derivation, it is clear that the following theorem holds. 


Theorem 1.15 (Equivalence of 4DVar and Kalman smoother). Let Hg and My for k = 
0,1,2,... be linear operators and data fı, f2,... be given. Then, 4DVar carried out 
with data fı,..., fx is equivalent to the Kalman smoother given in Definition 1.14 in the 
sense that the minimum of the 4DVar functional taking k = 0 and k = K in (1.50) is 
given by the analysis po fork = 1,2,...,K according to (1.81). 
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Proof. The proof for k = 1 is given in equations (1.72) to (1.79). The general case is 
directly obtained by iterating the arguments. 


In Definition 1.14, we worked with states at time to. Usually, the states of the 
> (a) 


Kalman filter are calculated at times tı, t2 etc. We need to propagate the states @; 
from time to to tk by 
pi” = Mo, and pi” = Moe”, (1.82) 
for k = 1,2,3,..., which means that 
pP” = Mx-ı (pf) (1.83) 


propagates the state from tk_ı to tk (see also (1.20)). The matrices B are propagated 
from to to tk by 


By” = MxoBr-1Mfo, and Bi” = MioBeMfo, (1.84) 


fork = 1,2,3,..., where the background matrix at time tx is obtained by propagating 
the analysis matrix from time tx_1 to tk by 


BP = My-1 B® MZ 4. (1.85) 


Note that the propagation of the state (1.83) and the propagation of the weight matrix 
(1.85) are equivalent to the propagation step in Bayes’ data assimilation for Gaussian 
probability densities and linear systems, see (1.47). 

Using (1.82) and (1.84), the iterative version of (1.81) is then given by 


-1 
pi? = pi” + BP HE (R+ HBP HY) (fe- Hp), (1.86) 

for k € N, often written in the form 
pi” =p” + Ky (Fi - Hy’) (1.87) 


with the Kalman gain matrix 
-1 
Ky := By Hg (R + HB HE). (1.88) 


Note that the Kalman gain matrix is identical to the Tikhonov regularization matrix 
(1.31). Using (1.85) and (1.80), we readily verify that the analysis matrix B Ne at time ty 
is obtained from the background matrix BP at time tk by 


(BP)? = (BP)? + iri > 


for k € N. Note that the analysis matrix B, in (1.89) and the analysis state ọ en 


in (1.86) is equivalent to the updated covariance matrix and the updated state in the 
analysis step in Bayes’ data assimilation for Gaussian probability densities and linear 
systems, see (1.48 and (1.49)). 

Often, another version of (1.89) is used, where the matrices appear without their 
inverse (see also (1.46)). 


(a) 
k 
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Lemma1.16. For k € N and a in (1.89), we have 
B® = (T- KHy)B”, (1.90) 
where Kx is given by (1.88). 
Proof. We start from (1.89) in the form 
BY = (I+ BO HER HK) Be (1.91) 
We expand 


T : = (1+ BHR) (I — KH) 


= (I + BY HER Hx) (I - By HE (R + HB,” HX) Hy) 


=i 
= I + BHR Hy — BH (R + HB HŽ) Hi 
e — 

=:$ =i] 


(1.92) 


-1 
- By HER AB, HL (R + HBP HŽ) Hk 


and remark that 
-1 
S = B{P HER (R + HęB{ HE) (R + Bye) Hk = Si + So, 
yielding T = I. Thus, 


si 
(1+ BP HRH) = (I-KyHy) 


and the proof is complete. 


We are now ready to define the Kalman filter (see, for example, [2, 39, 53]). 


Definition 1.17 (Kalman filter). Starting with an initial state pe” and an initial weight 


matrix B® := B, for k € N, the Kalman filter iteratively calculates an analysis ọ i 


at time tg fork = 1,2,... by 
(i) propagating the state © ta from tx-ı to tx via (1.83): 


b 
pi ! = Mk-ı (9) ’ 
(ii) propagating BY from tg-1 to ty following (1.85): 
Be u M-Be Mg, 
(iii) calculating the Kalman gain by (1.88): 


-1 
Ky = By Hg (R+ WB HS), 
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(iv) calculating an analysis state by (1.86): 


a b b 


B® = (I — KyHy) BP. 


The first two steps of the Kalman filter are often referred to as the predictor steps 
as they predict a state and a covariance estimate by propagating them forward via the 
model dynamics. The last two steps are called analysis steps to update the state and 
covariance estimate. 

The relationship between the Kalman filter, the Kalman smoother and 4DVar is 
summarized in the following theorem. 


Theorem 1.18 (Equivalence of 4DVar, Kalman filter and Kalman smoother). Let the 
operators Hk: X > Y fork € Nand My: X — X fork € No be linear. Let oo be 
the analysis of the Kalman filter at time tx, Ọ i the analysis of the Kalman smoother 
with data fı, ..., fk at time to, DNT x the minimizer of the 4DVar functional (1.50) at 


time to and define 


Di = Mro P Dark k=1,2,3,... (1.93) 

Then, 4DVar is equivalent to the Kalman filter and to the Kalman smoother in the sense 
that 

Pink = pi” = Mkopi” ; (1.94) 


if we start the iterations with the same initial background state pP and the same initial 


error covariance matrix B” := B. 


Proof. The equivalence of the Kalman smoother with the Kalman filter is obtained by 
our reformulation based on (1.82) worked out in equations (1.85) to (1.90). The equiv- 
alence to 4DVar is then a consequence of Theorem 1.15. 


Theorem 1.18 states that the Kalman smoother is equivalent to the Kalman filter 
(and 4DVar) at the end of some time window for linear operators. 

We finally consider the stochastic approach to the Kalman filter, which we formu- 
late as a basic theorem. Observing that the formulas for Bayes’ data assimilation with 
Gaussian densities as given in Definition 1.12 are identical to the update formulas for 
the Kalman filter according to Definition 1.17, the proof of this result is straightfor- 
ward. 


Theorem 1.19 (Equivalence of Kalman filter and Bayes’ data assimilation). For linear 
systems Mk : X — X, linear observation operators Hy : X — Y, and Gaussian proba- 
bility densities, the Kalman filter as given in Definition 1.17 is identical to Bayes’ data 
assimilation given by Definition 1.12. 
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For nonlinear system dynamics M; : X — X and nonlinear observation operators 
Hy : X — Y, the above equivalences do not hold any more. However, we may still 
apply the Kalman filter if we linearize both the model Mx and the observation oper- 
ator Hy about the considered state. This leads to the Extended Kalman Filter (EKF) 
[2, 46]. The linearizations of the model operator Mx and the observation operator Hx, 
which are used within the Kalman filter (1.17), are given by 
dH, 


and Mx(@x) := — lor, 


dM 
Mk (Pk) := =— ae 


a 
dp ” 
where Mx is the tangent linear model (1.56). 

We have introduced several data assimilation methods and shown that for linear 
systems, they are all essentially equivalent to cycled Tikhonov regularization with 
a weighted norm. In the next section, we consider ensemble methods which provide 
a way of (approximately) updating probability distributions and covariance matrices 
within the assimilation schemes. 


8 Ensemble methods 


We have introduced several methods for data assimilation in the previous sections, 
including Tikhonov data assimilation, 3DVar, 4DVar, Bayes’ data assimilation and the 
Kalman filter. 

Evaluating the different approaches, we note that 3DVar or Tikhonov data assim- 
ilation works with fixed norms at every time step and do not fully include all the dy- 
namic information which is available from previous assimilations. Since 4DVar uses 
full trajectories over some time window, it implicitly includes such information and 
we can expect it to be superior to the simple 3DVar. However, Bayes’ data assimi- 
lation or the Kalman filter are equivalent to 4DVar for linear systems and include all 
available information by updating the weight matrices and propagating them through 
time. This is essentially done implicitly in 4DVar. In general, we can expect them to 
yield results comparable to those of 4DVar. 

The need to propagate some probability distribution is a characteristic feature 
of the Bayes’ data assimilation and the Kalman filter. It is also their main challenge 
since the matrices B® or BP have dimension n x n, which for large n is usually 
not feasible in terms of computation time or storage, even when supercomputers are 
employed for the calculation as in most operational centers for atmospheric data as- 
similation. Thus, a key need for these methods is to formulate algorithms which give 
a reasonable approximation to the weight matrices B = with less computational costs 
than by the use of (1.85) and (1.89) or (1.90). 

Often, the approach to ensemble methods is carried out via stochastic estimators. 
Here, we want to stay within the framework of the previous sections and study the 
ensemble approach from the viewpoint of applied mathematics. The stochastic view 
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will be discussed in a second step. One of the most popular ensemble filter techniques 
is the Ensemble Kalman filter [3, 11, 24, 25, 41-43, 65, 70, 77, 84]. 


Definition 1.20 (Ensemble). An ensemble with N members is any finite set of vectors 
p) eX forl = 1,..., N. We can propagate the ensemble through time by applying 
the model dynamics M : X — X or My, : X > X, respectively. Starting with an initial 
ensemble ps”, £ =1,...,N, this leads to ensemble members 


p ) 
PM = Mil”, k=1,2,3,... (1.95) 


fort =1,...,N. 


We start with the construction of a particular family of ensembles generated by 
the eigenvalue decomposition of the weight matrix B := BP) defined in Section 7 
with X = R”. Bis a self-adjoint and a positive definite matrix, hence, there is a com- 
plete set of eigenvectors of B, i.e. we have vectors y™),..., y™® € X and eigenvalues 
AM, ...,A such that 


By® =y”, 9=1,...,n. (1.96) 


The eigenvalues are real valued and positive and we will always assume that they 
are ordered according to their size A > A > --- > A®), With the matrix A := 
diag(VA™,..., VA™] and the orthogonal matrix U := [w™,..., y™], we obtain 


B =UA2U* = (UA)(UA)*, (1.97) 


where we note that U* = U~!. This representation corresponds to the well-known 
principle component analysis of the quadratic form defined by 


E(p,w):= pp By, p,wex. (1.98) 


Geometrically, B defines a hypersurface of second-order with positive eigenvalues, 
whose level curves form a family of n — 1-dimensional ellipses in X. The principal 
axis of this ellipse are given by the eigenvectors wy"), £ = 1,...,n. 

The application of B to some vector @ € X according to (1.97) is carried out by 
a projection of @ onto the principle axis y® of B, followed by the multiplication 
with A“), This setup can be a basis for further insight to construct a low-dimensional 
approximation of B. 

Before we continue the ensemble construction, we first need to discuss the metric 
in which we want an approximation of the B-matrix. We remark that the role of B in 
the Kalman filter is mainly in the update formulas (1.85), (1.86) and (1.90). Here, to ob- 
tain a good approximation of the vector updates in L*, we need B to be approximated 
in the operator norm based on L? on X = R". That is what we will use as the basis for 
the following arguments. 
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Lemma 1.21. We construct an ensemble of vectors by choosing the N — 1 maximal eigen- 
values of B and its corresponding eigenvectors yw, ..., yN». We define 


Q:= [VAD y,... VAWD yD], (1.99) 
Then, we have the error estimate 


|IB- QQ*\|= sup [a] = [a] =a, (1.100) 
J=N..- N 


Proof. The proof is obtained from 
B - QQ* =UA*U", (1.101) 


with Ä? = diag[0,...,0,A®™,AN+D,...,A™], where there are N - 1 zeros on the 
diagonal of A. Since U is an orthogonal matrix, the norm estimate (1.100) is straight- 
forward. 


We are now going to use arbitrary ensembles p“,..., qo and construct ap- 
proximate weight matrices. From the Courant minimum-maximum principle, we 
know that 


A® = min max (p,Bop). (1.102) 
dimU=f-1 Yeu+,lgpli=1 
For an arbitrary ensemble pm"), ..., qo), we use the mean 
ı N 
p= 5 go” (1.103) 


to define the ensemble matrix 


Q:=|p® -p,...,.0™ =u], (1.104) 
and we define the ensemble subspace Ug by 
Uo = span {pP - p,...,0 -ul f (1.105) 


We call the vectors @'® — u, £ = 1,...,N the centered ensemble. We remark that 
dim Ug = N — 1. Then, we have 
|B-QQ*||=> sup |IB-QQ*e|| 
Bp1Vo.Ipl=1 


> sup ||Be|| 
B.LUQ,IMll=1 


> sup (p, BQ) (1.106) 
By .Ua,lipll=1 


= min sup (@,Bo) 
dim U=N-1 g 1 U,||\p||=1 
=A), 


The above results are summarized in the following theorem. 
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Theorem 1.22. Let the eigenvalues AU > A) >... > A™ of the self-adjoint weight 
matrix B be ordered according to its size and let p™,..., p“) with N € N be an ar- 
bitrary ensemble of states in X. Then, the error for the approximation of the weight 
matrix B by QQ* with Q defined in (1.104) is estimated by 


IB- QQ*||, = A®. (1.107) 


Remark 1.23. The optimal error A®? can be achieved if the centered ensemble spans 
the space of the N — 1 eigenvectors w“,..., yN=D of B corresponding to the largest 
eigenvalues A‘),...,A°N-)) with appropriate coefficients as in (1.99). 

Ensembles can be used to approximate the weight matrix B when the weight 
matrix B o is given, see (1.85). If B a is approximated by the ensemble g P T QD nn 
in the form E 

BP = aP (aP), (1.108) 
with o = [(p))@ — Ww, 22, (pM) — u], then by (1.85), we derive an ap- 
proximation for B A by 

B, = msm 


= MQP (0) Mg 


- ma (maje)* (1.109) 
u Qe (a) i , 
where QP) = MQW. 
Lemma 1.24. Consider the approximation ore,” by an ensemble g”, Sg pe with 
ensemble matrix Q ag . If the error satisfies 
Bi” =O" (y| <e, (1.110) 


for some € > 0, then the error estimate for the propagated ensemble at time tx+1 is 
given by 
< |m.||\|mil|e. (1.111) 


(b) (b) (b) \* 
[ee = Qk+1 (0 


Proof. Based on (1.109), the proof is straightforward. 


A key question of ensemble methods is how to update the ensemble in the data 
assimilation step. Given the data fk at time tk, how do we get an ensemble which 
approximates the analysis covariance matrix a given an ensemble which approx- 
imates the background error covariance matrix BP? We know that for the Kalman 
filter, the analysis covariance matrix B = is calculated from B A by (1.90). In terms of 
the ensemble approximations, this means 


a (a)" = T- KD a” (QW) an) 
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with the ensemble Kalman matrix 


=i 


Kr := 0 (ot He (r saar (oP) Hf) (1.113) 
leading to 
gom (49) 
u opf (Q?)* nr (r + HQP (aP) m) map Laps. (1.114) 
ee See n 
=:T 


The matrix T in the curly brackets is self-adjoint and positive semidefinite, and hence 
there exists a matrix L such that T = LL*. This finally leads to 


QP =QPL, (1.115) 
which we denote as square root filter [4, 8, 65, 79]. 


Lemma 1.25. Assume that pt”, BEN p™ is an ensemble which satisfies 


[e -aP (ap i <e, (1.116) 


with some € < IBP ||. Then, for the analysis ensemble defined by (1.115), we have 
Bi” = a (q P < Ce, (1.117) 
k . 


with some constant C not depending on Q. 


Proof. Using the notation a for the Kalman gain matrix in the general case ((1.88) 
and (1.90)), and Qo” (oy )* from (1.112), we write 


a) Q% (qQ@)* _ (1 _ Ke H,) (sP u QP (a®)") 
ek oO) (Or). 


with Kx defined Br (1.113). We remark that due to its special structure, the norm ofthe 
inverse (R + Hk ran Cy?) *HĚ) -1 in (1.113) is bounded uniformly independent of 
QP. Furthermore, using € < IBP ||, the norm 


er ee -pe er eey -ee 
< er” | +E (1.119) 


<2 | 


(1.118) 
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is bounded uniformly, leading to 
Ki" -Kell< ce, (1.120) 


with a constant c not depending on ‘hae Finally, a similar estimate applied to (1.118) 
yields the desired result (1.117) and the proof is complete. 


For further insight into ensemble methods, we refer to the article [69] in this book. 


9 Numerical examples 


We examine data assimilation techniques discussed in this article and their relation to 
inverse problem theory for simple model problems. First, we consider an advection- 
diffusion equation in Section 9.1 and then the Lorenz-95 system in Section 9.2. 


9.1 Data assimilation for an advection-diffusion system 


Consider the following linear (one-dimensional) advection-diffusion problem (see, 
for example, [15]). The system dynamics are described by 
2 


a a a 
5p P(x 8) = vz P(x, t) - azz p(x, t) (1.121) 


for x € (0,1) and t € (0, T). As boundary and initial conditions, we have 


m(0,t)=0, te(0,T), 
po(l,t)=0, te(0,T), 
pg(x,0) = po(x), x € (0,1). 


Here, v > 0 is the diffusion coefficient and a is the advection parameter. We want 
to determine the initial condition po from the measurements of the solution @ (x,t) 
at certain points in space and time. Let 0 = x9 < X1 < Xn = land x; = ih, 
i= 0,...,n + land h = _|_ With the discretizations of the spatial derivatives 


n+1* 
92 itl — 2pi + pi! 3 pi- gil 
axe” e h? Ben aa ie 
for i = 0,..., n, we obtain a system of ordinary differential equations of the form 


olt)=F(p), te(0,T], @(0)=@o, (1.122) 
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where, in this case, F(®) = Ko (t), that is, F is linear, with 


2 v a v 
| nh he | 
v a v a v 
wth mh he 
| a4 } a v a v | 
K= | h? h h? h h? | € rr+?xn+2 
v ` a v a v 
| wth mh he | 
v a 2X a 
aTh hh 
and m(t) = [m°(t),..., p"*(t)]? € R"*?, To satisfy the boundary conditions, we 
set p? (t) = p"+!(t) = 0 throughout. As an initial condition, we choose @!(0) = 


0.5 


0.3 


0.2 
0.1 time 


00 


Figure 1.1: Solution of p (t) = (exp Kt) Qo, t € [0,0.5] (discretized advection-diffusion equation 
(1.121)) for initial condition @o(x) = sin(1Tx). 


tions with constant coefficients (1.122) is given by 
p(t) = (expKt)po, t€[0,T], (1.123) 


where exp Kt € R”+2*"+2_ or, using an explicit first-order Euler scheme, we obtain 
the discrete linear model 


T 


= 1.12 
et (1.124) 


Pk+1 = Pk + AtK Mk, k=0,... 


where px = [pP,..., pet]? € R”+? and o? = pl!*! = 0 throughout. Note that we 
use a lower index to describe the time steps and an upper index to describe points in 
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space/components of yp. The approach (1.124) is a more practical implementation as 
the analytical solution (1.123) would only be available for certain problems. We solve 
the advection-diffusion problem (1.121) (using the Forward Euler method) with a = 1, 
v = 0.01, n = 100, final time T = 0.5, time step At = 0.001 and initial condition 
@Mo(xi) = sin(TTx;). The solution is shown in Figure 1.1. 
For the inverse problem (data assimilation problem), we suppose we do not know 
the initial condition po(x). We want to estimate @o(x) from measurements of r 
components pr(t), p?r (t),...,@"(t) of the solution q(t) at times tı = 0.002, 
to = 0.004, ..., tm = 0.5. For our experiment, we use r = 5, and hence we observe 5 
out of n = 100 components. Take noisy measurements of Hop(tı),H@(t>),..., 
H(tm), where H € R’*"*? is the observation operator matrix (which is linear in 
this case) given by Hij = 1 if j = "ij and Hij = 0 otherwise. We obtain the (linear) 
least squares problem 
min 


|Ev- rl]; (1.125) 
poERN+- 


with H and f for the forward Euler method and observations every second time step 
given by 


H(I + 2AtK) | 1 

1 H(T+2AtK)? fr 

H= jerrmn? and f=| l | ER”, 
H(I + N PA 


The observations are obtained using the output from the exact initial condition and 
the measurements usually contain noise (see Section 4 for a detailed description of 
the errors), that is, f = f° = fu9 + d?, where the noise is usually normally dis- 
tributed, that is, d ~ N (0, p?I), where p is the standard deviation. If we solve the 
problem using a naive approach with a standard least squares implementation [74], 
we obtain the result in Figure 1.2 (a). 

Using the singular value decomposition given in Lemma 1.3, we have H = VXU * 
and, with f = fu + d°, we obtain 


n+2 ö n+2 pl (true) T 76 
(f°, Vj; DY if vid 
pp => => ic +d a, 


j=l Oj Oj Oj 


and clearly for small singular values oj, the noise is magnified, hence the naive so- 
lution in Figure 1.2 (a). Figure 1.2 (b) shows what happens for this particular example. 
The singular values gj decay rapidly and only the coefficients lv? fl= |v; f?| above 
the noise level (here we chose d? ~ N (0, p?T) with p = 0.1) are useful and carry 
clear information about the data. 

In order to compute a better solution qo for the initial condition than the one 
given in Figure 1.2(a), we apply Tikhonov regularization. From (1.31), the Tikhonov 


(a) „10° 
— exact initial condition 
4 i = - - least squares problem 
\ 
3 4 
2 i 
1 i 
o 
ge 0 
LETLI ET n aah 
-ap Mer p 
3, + 8 8 
4, at 
-3 L it 
l A 
-4 i 
=5 


0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
X 


Exact initial condition o and naive solution 
to the least squares problem. 
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(b) 
10° 
10” 
w 
aD 
10° = 
o singular values o; 
e coefficients vj f 
-15 
10 A 
O 20 40 60 80 100 120 


Index j 


Plots of the singular values 0; and the 
coefficients lv; fl, forj =1,...,n4+2. 


Figure 1.2: Naive solution to the least squares problem (1.125) and singular values of H. 


regularized solution is given by 
i geh — 
pi” = py + BH" (aR +HBH*) (f-H* py”). 


For our problem, we use the observation error covariance matrix R = 0.011 (in line 
with the noise on the observations). For this particular problem, we chose py” = 
1 - 0.5rr? (x — 0.5)? for the background estimate, which is the truncated Taylor se- 
ries expansion of the true initial condition po. For the background error covariance 
matrix, we take B with entries B;; = 0.01 x exp( izt) and for &, we choose the val- 
ue & = 0.00359 which minimizes both the total error consisting of perturbation error 


|| Rd? || where Rx = BH” (&R + HBH*)-! and regularization error ||RaĦ po- poll, 


Regularization/reconstruction 
= - - Perturbation/data error 


0 0.005 0.01 0.015 
Regularization parameter a 


Figure 1.3: Regularization/reconstruction and data/measurement error for different values of & 
between 0 and 0.015. The optimal « in this case is found to be & = 0.00359. 
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Figure 1.4: Exact initial condition and regularized solution for the regularization parameter 
& = 0.00359 and the l2-norm error between the exact and regularized solution for the linear 
advection equation (1.121). 


see (1.19). The plots in Figure 1.3 show both the regularization and perturbation error 
for this problem. For the value x = 0.00359, the reconstruction of the initial con- 
dition is plotted in Figure 1.4(a) and the initial condition error is displayed in Fig- 
ure 1.4 (b). Note that similar computations can be done using no background pP, 
the standard situation in Tikhonov regularization, different background estimates, 
as well as different choices for the background error covariance matrices B. For the 
choice of &, which corresponds to the choice of the Tikhonov regularization param- 
eter, several heuristics are available, for example, the L-curve criterion [36], general- 
ized cross-validation [30] and the discrepancy principle [61], where the latter is most 
appropriate for large scale computations. 

We have essentially solved a 4DVar data assimilation problem, since we have 
shown in Section 6 that 4DVar can be written in the form of 3DVar which is merely 
a Tikhonov regularization, discussed in Section 3. 

The situation described above was an ideal situation. In reality, models are non- 
linear and imperfect, that is, they include model error. We give examples for these 
situations. First, consider a nonlinear problem. Instead of (1.121), consider 


0 d? 


TAAGA ae 


and the discrete nonlinear problem becomes 


0 3 


T 
Prrı = Pk + AtKPk + Pk = Me(Mu), K=0...,7,- (1.126) 


We set up the nonlinear least squares problem 


— 2 
min = |/H(@o) -f|\, ; 


poeR”+2 
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Figure 1.5: Exact initial condition and regularized solution for the regularization parameter & = 1 
and the l2-norm error between exact and regularized solution for the nonlinear advection equation. 


where here H is a nonlinear operator. The minimization problem can be solved using 
the Gauss-Newton method [21, 64]. The results for the reconstructed initial condition 
for the same data as for the linear problem are displayed in Figure 1.5(a) and the 
initial condition error is displayed in Figure 1.5 (b). 

Finally, consider the case where some model error is present. To this end, we as- 
sume that the observations are created by the true model for the nonlinear advection- 
diffusion equation (1.121) with a = 1, v = 0.01. The model used in the data assimi- 
lation process uses perturbed parameters aP* = 1.1, vPet = 0.009. The results for 
the reconstructed initial condition are shown in Figure 1.6 (a) and the initial condition 
error is displayed in Figure 1.6 (b). As the model contains an error, we are trying to fit 
an initial condition for the wrong model and hence the error for this problem is rather 
large, as seen in Figures 1.6 (a) and 1.6 (b). 

However, in Figures 1.7 (a) and 1.7 (b), we see that this relatively large error in 
the initial condition does not lead to large errors in the solution. Figure 1.7 (a) shows 
the solution to the nonlinear advection equation with exact initial condition and Fig- 
ure 1.7 (b) shows the solution with the perturbed initial condition obtained after solv- 
ing the inverse (data assimilation) problem. We see that as the solution is propagat- 
ed forward in time, the error in the initial condition is smoothed. The reason is the 
smoothing property of the forward operator. We have @x+ı = Mk (pkp) where Mk is 
a linear (that is, J + AtK) or a nonlinear (1.126) operator. If the initial condition is 
perturbed by Cx, then we have @x+ı + Ck+1 = Me (Mx + Cx), and to leading order 


Ck+1 = Mk(Mx)TEx, 


where M; is the discretized tangent linear model. Assuming that Mg (gk) = M (which 
holds for our linear example), then in the limit, we have Ck = M¥ Ço. From basic linear 
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Figure 1.6: Exact initial condition and regularized solution for the regularization parameter & = 1 
and the l2-norm error between exact and regularized solution for the nonlinear advection equation 
when a model error is present. 
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Figure 1.7: Solution to nonlinear advection-diffusion problem with exact and perturbed initial 
condition. 


algebra [31], we have that Çk — 0 if p(M) < 1, where p(M) = max{|A|,A € A(M)} 
is the spectral radius. In our example, both for the linear and linearized nonlinear 
model dynamics, the eigenvalues of M(x) are within the unit circle, explaining 
the smoothing of the error in the initial condition as the solution propagates in time. 

In the next example, we consider problems which are more sensitive to the initial 
conditions, that is, systems that exhibit chaotic dynamics (and hence more accurate- 
ly represent the effects in, say, weather forecasting). One such system is the Lorenz- 
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95 model. In reality, we would expect a mix of situations arising from chaotic and 
smoothing systems. 


9.2 Data assimilation for the Lorenz-95 system 


As asecond example, consider the Lorenz-95 system [55, 56], that is, a generalization 
of the well-known three-dimensional Lorenz-63 system [54]. The model is given by 
a system of N coupled nonlinear ordinary differential equations whose solution ® 
with components p = [p!,..., ^] satisfies 


dp! 
dt 


i-1 


=-p °p! + plp! - gpt+f, te(0,T], pt(0) =p, (1127) 


where i = 0,...,N, with cyclic boundary conditions p° = pN, po! = pNI, 
pNt! = p! and f is a forcing term. For a forcing term f = 8, the system is chaotic 
(i.e. it has positive Lyapunov exponents, see [76]). For N = 40, the system has 13 
positive Lyapunov exponents. Lorenz [55] observed that this system has a similar er- 
ror growth characteristic as an operational numerical weather prediction system if 
a time T = 1 is associated with 5 days. 
We solve (1.127) using the classical 4th order explicit Runge-Kutta scheme, which 
gives 
T 
Pr+ı = Mgk(Pk), where x = [9% OF | , (1.128) 


and Mx, is the nonlinear model operator which propagates @x to @x.ı. The solu- 
tion trajectory of two components of ~ computed with the Runge-Kutta method, and 
At = 0.01 and T = 21 is displayed in Figure 1.8. In order to illustrate the chaotic 
dynamics of the Lorenz-95 model, we run it with slightly perturbed initial conditions. 
Perturbing the initial condition randomly with an error of about 10% gives the en- 
semble of forecasts in Figure 1.9 (a) and using a perturbation of about 0.1 % gives the 
forecast ensemble in Figure 1.9 (b). We only show the trajectory of site 20. 

The figures show an unperturbed solution trajectory and an ensemble where the 
initial conditions have been slightly perturbed. It is easy to see that the larger the 
perturbation in the initial condition, the more the error in the forecast grows. For this 
problem, the eigenvalues of the matrix Mg (Qx) from the linearization of (1.128) are 
not necessarily within the unit disk. 

We carry out some data assimilation experiments with this problem. First, con- 
sider the 4DVar minimization problem (1.50). We need to minimize 


K 

J(90) := (po - pP ae (po - po”) +>. (fi H(p;)) R? (fi - Hop), 
j=l 

(1.129) 
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Figure 1.8: Components 1 and 20 of the solution to (1.127). 
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Figure 1.9: Trajectory of site 20 of Lorenz-95 system of size 40. Green thick line: unperturbed 
forecast. Black lines: Ensemble of 20 perturbed forecasts. 


where p; = M,-ı(@;j-ı) is given by (1.128). We have 


K 
Vo J (Po) = 2B! (po - ph”) — 2 X (Mjo (po) "HR! (fj - HMjo(o)), 
j=1 
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where M; o is given by (1.2) and Mj o is the tangent linear model. In order to minimize 
the cost function, we need V »,J(@o) and in order to solve this problem, we apply 
Newton’s method. The Hessian (or the Jacobian for Newton’s method) is given by 


K 
VVoJ(P0) = 2B-1 +2 X (Mjo(po) HTR-!HMjo(po)) + Q (90), 
j=l 


where Q(q@o) involves terms including second derivatives of the system dynamics. 
These are usually neglected since for large problems, they are inefficient, impractica- 
ble and often infeasible to calculate. Hence, we solve 


O (2) 
VV pod (Po) Apo = V po J (Po ); 


(+1) Œ) (2) 
Po = Po +AQo'; 


for £ = 0,1,..., where pf” is the {th iterate of Newton’s method. For the initial 
condition, the background state is usually chosen, that is, py’ = pP, We perform 
data assimilation for a single assimilation window of length 100 time steps, followed 
by a forecast of 2000 time steps. First, we carry out an experiment with perfect ob- 
servations. For the background estimate, we choose a perturbed initial condition and 
B = 0.011. Checking the singular values of the observability matrix for this prob- 
lem, we obtain that the singular values lie between 4 and 30, and the problem is not 
ill-conditioned. This is in contrast to the problem in Section 9.1, where the forward 
operator has very small singular values, which, however, led to a smoothing property 
of the forecast. The problem here lies in the fact that the forecast error grows severely. 
The inverse problem is not actually ill-conditioned as such, but the forward prob- 
lem exhibits severe error growth for small perturbations! Figure 1.10 shows the Ist 
and 20th component of ® before and after the data assimilation process. The error 
between the true solution and the trajectory before and after the 4DVar data assimi- 
lation process is shown in Figure 1.11. We observe that the error in the analysis (thick 
line) is reduced significantly (compared to the background) in the first 600 time steps 
(where the assimilation window is of length 100 time steps). After that, we see that 
the effect of the chaotic dynamics emerges and the error grows since the initial con- 
dition of the analysis vector is perturbed from the true initial condition. The initial 
condition error is of order © (107°) at each of the sites. From Figure 1.9 (b), we cannot 
anticipate a better performance of the forecast. We expect the results to be best for 
perfect and full observations. Next, we carry out an experiment with noisy observa- 
tions. The observations are generated from the truth with an error of mean zero and 
covariance R = 0.011. Moreover, we only take observations every five time steps and 
we only observe 8 of the 40 variables (precisely, we observe every 5th component). 
For the background state, we use a perturbed initial condition, though this time with 
background error covariance matrix B with entries B;; = 0.01 exp( Sa), We ob- 
serve that the singular values of the observability matrix for this problem lie between 


44 = Melina A. Freitag and Roland W. E. Potthast 


-10F | 7 
H L L L L L í L L L L 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 
Time step 
(ete Truth 
O Observations 
Initial guess 
= Analysis 


X | i 1 | 1 | J 1 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 
Time step 


Figure 1.10: Components 1 and 20 of the solution to (1.127) for full and perfect observations. The plot 
shows the observations, the assimilation window, the exact trajectory, the background trajectory 
and the final solution (analysis) after 4DVar. 


0.02 and 7, and, not surprisingly, the problem is slightly worse conditioned than the 
one for full observations. 

Figure 1.12 shows the error between the true solution and the trajectory before 
and after the 4DVar data assimilation process. We observe that the error in both com- 
ponents is not reduced as much as the error in Figure 1.11 (for perfect and full obser- 
vations), which is to be expected as we observe fewer components and moreover, the 
observations are noisy. Note that with our setup, the 1st component is an “observed 
site,” where the 20th component is unobserved. We can therefore explain the slightly 
worse assimilation results of the trajectory of the 20th component compared to the 
trajectory of the 1st component in Figure 1.12. 

To explore this relation further, Figure 1.13 shows the absolute value of the error in 
the initial condition for this problem, including the sites of the observations. Clearly, 
at the observation sites, the analysis error is generally smaller than at the unobserved 
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Figure 1.11: Error of components 1 and 20 of the solution to (1.127) for full and perfect observations. 
The plot shows the error in the background trajectory and the error in the final solution (analysis) 
after 4DVar. 


sites. However, this is not always true as information about the true state from the 
observations is spread to the unobserved sites through the coupling of the problem 
and via the background error covariance matrix B. 

We carried out tests with other data assimilation algorithms such as 3DVar and 
the Extended Kalman Filter (EKF). We do not report the results for 3DVar here, but 
mention that for full perfect observations, 3DVar produces very small errors at the end 
of the assimilation window as we have perfect observations which are sequentially 
assimilated into the trajectory. Then, the forecast is run from a very small error at 
the end of the assimilation window. With fewer and noisy observations, 3DVar gives 
worse results than 4DVar (as in 4DVar, the missing information is assimilated via the 
system dynamics). Also, if a model error is included in the system dynamics (that is, 
the observations are created from the true trajectory, whereas in the data assimilation 
process, we use a different, perturbed model, replicating the practical situation), we 
obtain worse results than for the perfect model, as would be expected (Section 9.1). 
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Figure 1.12: Error of components 1 and 20 of the solution to (1.127) for partial and noisy 
observations. The plot shows the error in the background trajectory and the error in the final 
solution (analysis) after 4DVar. 
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Figure 1.13: Error in the initial condition and observed sites for the solution to (1.127) after 4DVar for 
partial and noisy observations. 
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Finally, we apply the EKF to the Lorenz-95 problem. If we use the same back- 
ground error covariance matrix and the same initial condition as for 4DVar, we obtain 
essentially the same results as for 4DVar (as would be expected from Theorem 1.18). 
The results here are only approximately equivalent as Theorem 1.18 only holds for the 
Kalman filter applied to linear system dynamics. However, when plotting the error, 
we hardly observe any difference. 

A better result as for 4DVar is obtained for the EKF if a better background error 
covariance matrix is chosen. To this end, we use the covariance matrix produced by 
the EKF (after one data assimilation cycle at time step 100) as the initial background 
error covariance matrix for anew EKF experiment applied to the data assimilation 
problem we consider. This should give a better (flow-dependent) background error 
covariance matrix. This is indeed true as seen in Figure 1.14 compared to Figure 1.12. 
The new (flow-dependent) background covariance matrix can also be used for 4DVar, 
resulting in a hybrid method [9]. 
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Figure 1.14: Error of components 1 and 20 of the solution to (1.127) for partial and noisy 
observations. The plot shows the error in the background trajectory and the error in the final 
solution (analysis) after applying the EKF. 
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10 Concluding remarks 


Inverse problems are an area of research dealing with the reconstruction of functions 
or parameter distributions from measurements. It has evolved over nearly 100 years 
in many applications, for example, in electromagnetics and acoustics, in medical 
imaging and elastography. Today, a growing community of researchers employs both 
a large set of well-established methods for linear and nonlinear inverse problems as 
well as a large variety of specific new methods for reconstructions and imaging. 

Data assimilation has evolved as a very important and popular research area from 
specific applications such as weather prediction or hydrology. Using measurement 
data to control the evolution of dynamical systems shares many of the features which 
are integral parts of inverse problems. Since World War II data assimilation has fo- 
cused on the state estimation problem, that is, the reconstruction of the state € X 
of the dynamical system under consideration, where X denotes the particular state 
space. Often, parameter functions are also involved and lead to an extended state 
space which includes unknown parameter functions as well. The algorithms which 
have been introduced here can easily be applied to this most general situation. 

Historically, the communities of inverse problems and data assimilation have 
evolved independently, with particular notation and approaches which are similar in 
content, but have been expressed in a different type of notation or terminology. One 
main goal of this article has been to describe key approaches to data assimilation 
in an inverse problems terminology, such that the dynamic inverse problems can be 
easily identified by the inverse problems community. At the same time, we provide 
an introduction into a functional analytic view for the data assimilation community 
which is often second priority by those working on important applications. 

Today, the convergence of inverse problems and data assimilation is driven by the 
evolution of modern remote sensing measurement technologies. For example, there 
is an increasing set of satellite infrared and microwave sounders, such that their as- 
similation into atmospheric models involves the use of ill-posed measurement oper- 
ators. New radar machines not only measure Doppler shift and reflectivity of atmo- 
spheric meteors, but also polarization. Ground-based LIDaRs involve further highly 
ill-posed measurement operators. Further techniques, for example, GPS/GNSS slant 
delay measurements, lead to ill-posed tomographic problems which become integral 
parts of operational data assimilation. We believe that the framework which we pre- 
sented provides an adequate approach to the further development of these systems. 

There is also a need for convergence on the level of assimilation algorithms. Clear- 
ly, methods like 3DVar or 4DVar are basically a version of Tikhonov regularization. 
Additionally, modern ensemble or particle methods increase the need for mathemat- 
ical analysis with tools from functional analysis and approximation theory since for 
typical applications, only a very limited number of ensembles or particles can be used 
and we are in the range of low-dimensional approximation theory rather than in the 
stochastic limit of an infinite ensemble. 
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Our article has aimed to contribute to the convergence by presenting a concise 
introduction into key algorithms and results in a functional analytic language which 
has the potential to be understood by a large range of mathematicians, thus building 
a basis for further research and developments. We have included both the viewpoint 
of deterministic mathematics, numerical analysis and functional analysis as well as 
stochastics and Bayesian reasoning. Understanding important state-of-the-art algo- 
rithms within a uniform framework is a key step today to further develop the tools 
which are known to have the highest impact on society with respect to such crucial 
areas as high-impact weather, logistics, travel and energy supply by renewable energy 
resources. 
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Variational data assimilation for very large 
environmental problems 


Abstract: Variational data assimilation is commonly used in environmental forecast- 
ing to estimate the current state of the system from a model forecast and observation- 
al data. The assimilation problem can be written simply in the form of a nonlinear 
least squares optimization problem. However, the practical solution of the problem 
in large systems requires many careful choices to be made in the implementation. In 
this article, we present the theory of variational data assimilation and then discuss 
in detail how it is implemented in practice. Current solutions and open questions are 
discussed. 
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1 Introduction 


Data assimilation is the process of combining a numerical model forecast with obser- 
vational data in order to estimate the current state of a dynamical system. It has been 
an essential part of numerical weather prediction (NWP) since its beginnings in the 
1940s, when it was recognized that errors in the initial model state could rapidly lead 
to large errors in the forecast. Early data assimilation schemes were based on a sim- 
ple interpolation between the observations and the model state, with later schemes 
also taking account of the statistics of the errors in the data. Such schemes included 
smoothing splines, successive correction, optimal interpolation and analysis correc- 
tion [82, 85]. The possible use of methods based on variational calculus was proposed 
by Sasaki [103, 104] in the late 1950s and 1960s, but at the time a practical implemen- 


I would like to thank two anonymous reviewers whose detailed reading of the manuscript helped lead 
to many improvements in the final version. 


The author is supported in part by the U.K. Natural Environment Research Council, through the Na- 
tional Centre for Earth Observation. 


56 —- Amos S. Lawless 


tation was not possible. A real breakthrough in the application of variational schemes 
to NWP came in the late 1980s with a series of papers demonstrating how the prob- 
lem could be solved using techniques from the theory of optimal control, in particu- 
lar the use of adjoint equations to calculate the gradient of an objective function, or 
cost function [77, 107]. This led to a series of papers in which the feasibility of vari- 
ational data assimilation was studied on a series of different simplified atmospheric 
models [26, 93, 98, 108] (these experiments usually only included the large scale at- 
mospheric dynamics and not the subgrid-scale processes of full weather prediction 
models). 

Despite the encouraging results of these experiments, variational data assimila- 
tion remained impractical for operational use due to the high computational cost. 
The introduction of the incremental method of variational assimilation in 1994 [27], 
together with increasing computing power, opened up the possibility of an affordable 
implementation for operational weather prediction. Over the following decade, many 
weather forecasting centers began to develop variational data assimilation for oper- 
ational use [42, 43, 61, 84, 99, 100]. At the same time, variational data assimilation 
began to be applied to other applications, such as ocean forecasting [112, 116] and 
atmospheric chemistry [38]. 

A common feature of many of these applications is that the size of the state vari- 
able being estimated is extremely large. Current numerical weather prediction models 
may require the initialization of the order of 108 variables in order to make a forecast. 
As computing power increases, the spatial resolution of the models tends to increase 
and hence so does the number of variables being represented. Furthermore, the real- 
time nature of environmental forecasting requires that the data assimilation problem 
be solved quickly. These two factors imply that when implementing variational data 
assimilation schemes in practice, compromises must be made. Hence, it is important 
to design the algorithms carefully to ensure that as accurate a solution as possible is 
obtained within the time available. Ideally, such design should also include knowl- 
edge of the physics of the problem, so that the final solution is physically realistic. In 
the remainder of this article we will discuss some of the different choices that arise 
in the implementation of variational data assimilation for very large systems and the 
practical approaches that have been developed. First, we briefly present the mathe- 
matical theory of variational data assimilation. 


2 Theory of variational data assimilation 
We consider a discrete nonlinear dynamical system given by the equation 
Xi+1 = Mi(Xi), (2.1) 


where x; € R” is the state vector at time t; and M; is the nonlinear model operator 
that propagates the state at time t; to time t;, fori = 0,1,..., N — 1. We assume that 
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we have imperfect observations y; € RPi at times tj,i = 0,..., N that are related to 
the system state through the equation 


yi=H(&)+ei, (2.2) 


where H; : R” — RPi is known as the observation operator and maps the state vector 
to observation space. The observation errors €; are usually assumed to be unbiased, 
serially uncorrelated, Gaussian errors with known covariance matrices R;. For the nu- 
merical weather prediction problem, the vector x; would contain several meteorolog- 
ical variables, for example, pressure, temperature and the three-dimensional wind at 
each grid point of the model domain. The observation operator H; may just be a sim- 
ple interpolation in space if the state variable is observed directly. However, it could 
be a much more complicated nonlinear function of the state. For example, for a satel- 
lite radiance measurement, the observation operator can include a complex radiative 
transfer model. 

We assume that at the initial time to we have an a priori estimate of the state, 
usually referred to as a background field, that we denote x”. This background field 
is assumed to have unbiased, Gaussian errors with known covariance matrix B. In 
practice, the background field is usually a short-term forecast of the state from a pre- 
vious assimilation cycle. The problem of four-dimensional variational data assimila- 
tion (4DVar)! is then to find the initial state that minimizes the weighted least squares 
distance to this background while minimizing the weighted least squares distance of 
the model trajectory to the observations over the time interval [to, tx ]. Mathematical- 
ly, we can formulate this as an optimization problem: Find the state xj at time to that 
minimizes the function 


T 


(Hi) -yi) R7 (Hi) -yi) (2.3) 


Maz 


I&0) = 5 (xo - x”) 


hole 


B! (xo -x) +5 
i=0 


subject to the states x; satisfying the nonlinear dynamical system (2.1). In the case 
where N = O there is no model evolution and the scheme is referred to as three- 
dimensional variational data assimilation (3DVar). The solution xj is commonly re- 
ferred to as the analysis. In environmental data assimilation, the function 7(xo) is 
usually called the cost function, but the terms objective function and penalty function 
are often used in other fields. 

The minimization problem given by equation (2.3) can be interpreted in a statis- 
tical or deterministic sense. From Bayes’ theorem, it can be shown that x9 gives the 
maximum a posteriori estimate of the state under the assumptions given [82]. This in- 
cludes the assumption of Gaussianity of the error statistics for the background field 


1 The scheme is referred to as four-dimensional since we usually fit three spatial dimensions in time, 
with time being the fourth dimension. 
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and observations. In practice, this assumption may not always hold. For example, for 
variables that are inherently nonnegative, such as humidity in the atmosphere or con- 
centrations in chemical models, Gaussian statistics may not be appropriate. In some 
cases these errors may be treated by assuming a lognormal distribution and using 
this to transform to variables whose statistics are Gaussian [13, 41]. Some allowance 
for non-Gaussian observation errors may also be made using the method of varia- 
tional quality control, as discussed in Section 3.3. Furthermore, nonlinearity in the 
dynamical model implies that the background errors are likely to be non-Gaussian if 
the background comes from a forecast whose length is beyond the linearity regime of 
the model. For this reason, in numerical weather prediction the background field is 
usually from a forecast of only 6 or 12 hours. In some applications, such as the identi- 
fication of the source of an atmospheric tracer, it may be more appropriate to specify 
other prior error distributions [12]. The alternative, deterministic interpretation of the 
minimization problem is to consider the term measuring the fit to the background 
state as a form of Tikhonov regularization in fitting the observations [29, 65, 90]. Each 
of these interpretations is able to provide different insights into the practical formu- 
lation of the problem. 

It is instructive to consider the solution to the 3DVar problem under the hypothe- 
sis that the observation operator Ho is approximately linear, that is, 


Ho (x?) — Ho(Xo) ~ Ho (x?) (x? = xo) (2.4) 


where Ho(xP) is the Jacobian of Ho evaluated at x” (This assumption (2.4) is referred 
to as the tangent linear hypothesis). In this case, the minimum value of (2.3) can be 
written explicitly as 


x^ = x? + BH) (HoBH) + Ro) (yo — Ho (x”)) . (2.5) 


This solution is equal to the best linear unbiased estimate (or BLUE). We see then 
that the analysis increment, defined as the difference between the analysis and the 
background x^ -xP, lies in the range space of the background error covariance matrix 
B. We return to the implications of this in Section 3.2. 

The covariance of the analysis error in this case is given by 


A = (B! + HER’ Ho) . (2.6) 


We find that for both 3DVar and 4DVar, this is equal to the inverse of the Hessian of 
the cost function, that is, 


A= (v27) . (2.7) 


In general, an exact solution cannot be found and the cost function is minimized 
using iterative numerical methods, such as conjugate gradient or quasi-Newton meth- 
ods. The use of these methods in data assimilation is discussed in more detail in Sec- 
tion 3.4. On each iteration of such methods, the value of the cost function and its 
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gradient at the current iterate must be calculated. In order to calculate the gradient 
of (2.3) with respect to the initial state x), we consider the discrete Euler-Lagrange 
equations. We introduce Lagrange multipliers A; at time t; and define the Lagrangian 


by 
N-1 


L£(Xi, Ai) = IX) + X, Aiea Kiva — Mi(xi)). (2.8) 
i=0 
Then, necessary conditions for a minimum of (2.3) subject to the constraint are found 
by taking variations of £ with respect to A; and x;. The first of these leads to the 
original nonlinear model equations (2.1), while the latter gives the discrete adjoint 
equations 
Ai = MT Aisi - HER; (Hi (xi) - yi) (2.9) 


fori = 1,...,N with boundary condition An+ı = 0, where H; and M; are the Jaco- 
bians of the nonlinear operators H; and M; with respect to the state variable x;. In 
the data assimilation literature, these Jacobians are referred to as the tangent linear 
operator and the tangent linear model (TLM) and the operators H} and M7 are the 
adjoints of the observation operator and the nonlinear model operator. From (2.8) we 
then have that the gradient of the Lagrangian with respect to the initial state xo is 
given by 

OL T Tn-1 -1 b 

a = -M9 Ai + H Ro (Hlo(xo) = yo) +B (xo =X ) ‘ (2.10) 
From the theory of Lagrange multipliers, this is equal to the gradient of the function 
under the constraint, and thus we can write 


VI(Xo) = —Ao + BI (xo — x?) ; 2.11) 
where we have introduced the extra variable 
Ao = Mg à — H R5 ' (#0 (20) - yo), (2.12) 


which can be calculated from the adjoint equations (2.9) with i = 0. Hence, the ad- 
joint equations provide an efficient method for calculating the gradient information 
needed for the minimization algorithm. Each iteration of a numerical optimization 
method therefore requires one run of the forward model (2.1) to calculate the value of 
the cost function and one run of the adjoint model (2.9) to calculate the gradient. This 
makes 4DVar very expensive from a computational point of view. 

We note that in this derivation, we have implicitly taken the adjoint with respect 
to the Euclidean inner product. For a general linear operator L: X1 — X2 and in- 
ner products (.,.)xı, (.,.)x2 in the spaces X1, X2 respectively, the adjoint of L is the 
operator L* : X2 — X1 such that 


(LxX1, X2) x2 = (X1,L* x2) yy (2.13) 
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for all xı € X1,x2 © X2. In the case where the Euclidean inner product is used 
in both spaces, the adjoint is equal to the transpose operator, which is why we define 
the transpose matrices H7 and M7 as the adjoint operators. In this case, the Lagrange 
multipliers provide the correct gradient of the cost function with respect to the state 
vector, but it is difficult to interpret physically what these variables mean. For oth- 
er applications of adjoint modeling, for example, generating initial perturbations for 
ensembles of forecasts, it may be desirable to give a physical interpretation to the 
gradients calculated from the Lagrange multipliers. In these applications, other in- 
ner products may be used, for example, based on the energy or enstrophy? of the 
system [95]. 


2.1 Incremental variational data assimilation 


The possibility of implementing variational data assimilation in an operational set- 
ting came with the proposal of incremental variational data assimilation [27]. In this 
formulation, the solution to the nonlinear minimization problem (2.3) is approximat- 
ed by a sequence of minimizations of linear quadratic cost functions. We define x9 
to be the kt! estimate to the solution and linearize the cost function (2.3) around the 


model trajectory forecast from this estimate. The next estimate is then defined by 


xD _ x oe 


+ 6X, (2.14) 


where the perturbation bx,” 


JW (oxi) _l z [x! xo ]) By! (5x 7 [x? -x0®]) 


+5 > (Hid) - a a) R (1,5x®) u a) 


€ R” is a solution of the linearized cost function 


(2.15) 


Here, a” = yi- Hı ei), where a is the nonlinear trajectory calculated from 
the current estimate at the initial time using the nonlinear model equation (2.1). The 
perturbation 6x; satisfies the linear dynamical equation 


OXi+1 = MjOxX;. (2.16) 


The linearized observation operator H; and the tangent linear model operator M; are 
evaluated at the current estimate of the nonlinear trajectory, usually called the lin- 
earization state. The minimization (2.15) is referred to as the inner loop, while the up- 
date of the nonlinear model trajectory x is the outer loop. On each iteration of the 


2 In fluid dynamics, the enstrophy is defined as the mean square vorticity of the fluid [58, Sec- 
tion 13.4]. 
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inner loop, the TLM is integrated to calculate the evolution of the perturbation in or- 
der to calculate the cost function (2.15), and the adjoint model is integrated to provide 
the gradient. 

The incremental method was later shown to be equivalent to an inexact Gauss- 
Newton method applied to the original nonlinear cost function (2.3) [72]. If we consid- 
er a general nonlinear least squares cost function 


p(x) = RER) (2.17) 


with f(x) : R” — RP and let J(x) be the Jacobian of f(x) with respect to x, then the 
Gauss-Newton method for minimizing ¢ is 
Algorithm 2.1 (Gauss-Newton). 


step 0: choose x!) 
step 1: repeat until convergence 


step 1.1: compute 6x = —((J((x™)7J((x™))-!J((x® TEx) 


step 1.2: update xt) =x + ôx. 


Sufficient conditions can be found such that the algorithm will converge to a local 
minimum of (2.17) [34]. Step 1.1 of the algorithm is equivalent to solving the minimiza- 
tion problem 


min J (x) Ox + £(x)||5. (2.18) 
If we define 
B-! (xo — x’) 
Ro! (Holxol - ya 
f(x) = - o (Ho > yo) (2.19) 


IR; (Hy [xy] - a) 


subject to (2.1), then the general cost function (2.17) is equal to the 4DVar cost func- 
tion (2.3). Applying the Gauss-Newton method to solve this problem, we find that the 
inner minimization step (2.18) is equivalent to the linearized cost function (2.15). 

An advantage of using this method to solve the nonlinear problem is that each 
inner minimization problem is quadratic in öx. Hence, whereas the nonlinear prob- 
lem may have multiple minima, the inner problem has a unique solution that can 
be found efficiently using iterative minimization methods (we discuss these methods 
further in Section 3.4). Since these minimization methods are usually truncated ac- 
cording to some stopping criterion, the inner step of the Gauss-Newton method is 
not solved exactly. In this case, the outer loop iterations can be shown to be local- 
ly convergent under certain conditions, provided that the inner loop minimization is 
solved to sufficient accuracy [45, 71]. 
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In practice, very few outer loop steps are performed. For example, the Met Office 
in the U.K. performs only one, while the European Centre for Medium-range Weather 
Forecasts (ECMWF) performs three [39, 100]. As for the fully nonlinear problem, the 
incremental method can be run as 3DVar (no model evolution) or 4DVar (including 
the model evolution). An alternative formulation that is often implemented is known 
as 3D-FGAT (First Guess at Appropriate Time). This includes the nonlinear model evo- 
lution in the calculation of the vectors d;, but no evolution is included for the per- 
turbation and the TLM operator M; in equation (2.16) is replaced by the identity. This 
ensures that the observations are compared with the nonlinear trajectory at the cor- 
rect time, but approximates the perturbation in such a way that no TLM or adjoint 
model is needed. In this way, some of the benefits of 4DVar can be achieved without 
too much extra computational cost [70, 87]. 

A major advantage of the incremental approach is that the inner loop minimiza- 
tion problem may be solved in a smaller dimensional space than the outer loop step, 
for example, at a lower spatial resolution. In this way, the TLM and adjoint model 
need only be run at the lower resolution on each inner loop iteration, while the lin- 
earization trajectory from the nonlinear model is still calculated at the higher resolu- 
tion on each outer loop. This is discussed further in Section 3.5. The computational 
savings made by implementing the inner loop in this way made incremental 4DVar 
feasible for operational weather and ocean forecasting. 

Having presented the basic theory of variational data assimilation, we now ex- 
amine some of the issues that arise in its practical implementation. For the very large 
systems found in environmental modeling, it is not always possible to apply the the- 
ory in an intuitive way. Many choices must be made in order to setup and solve the 
assimilation problem efficiently and compromises must often be made. It is the atten- 
tion to detail in these choices that can determine the success or otherwise of the data 
assimilation scheme. 


3 Practical implementation 


3.1 Model development 


The development of a 4DVar scheme for the large models used in operational weather 
and ocean forecasting is a huge undertaking. In most cases, the nonlinear model code 
already exists and has been developed over many years. These models are very large 
pieces of software, with maybe close to one million lines of code. In order to develop 
an incremental 4DVar scheme, the code for the TLM and adjoint model must first be 
written. The development of a TLM code and adjoint model code from the source code 
of anonlinear model is a fairly automatic procedure. The correct code for the TLM can 
be found from a linearization of each statement of the nonlinear model source code 
based on treating the nonlinear model as a series of arithmetic operations and ap- 
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plying the chain rule. The adjoint model is then found by a line-by-line transpose of 
the TLM source code in reverse order. This method is known as automatic differenti- 
ation. We do not go into details of its application here, but refer the reader to several 
good introductions in the literature [10, 26, 44, 102]. The automatic nature of this pro- 
cedure has led to many software tools being developed that will produce a TLM and 
adjoint model code from a nonlinear mode source code. These automatic differentia- 
tion tools, or automatic adjoint compilers, are now available commercially for many 
different programming languages. ? 

In practice, the TLM and adjoint models of many large environmental models 
have been developed by hand, rather than using the automatic compilers. There are 
several reasons for this. The first is that in many cases of operational weather and 
ocean forecasting, the complexity of the already existing nonlinear model codes was 
such that simple application of the automatic compilers was not possible. In many 
cases, particularly for large codes developed by many people, it is necessary to tidy 
the nonlinear model codes to make them suitable for use with the automatic compil- 
ers. Many centers felt that the effort to do this would have been greater than coding 
the TLM and adjoint model by hand. 

The second reason for developing the TLM and adjoint codes by hand arises 
from the nature of the incremental approach to variational data assimilation. Since 
the TLM and adjoint are run at a lower resolution in the inner loop, the TLM is al- 
ready an approximate linearization of the nonlinear model used in the outer loop. It 
is therefore justifiable to make further simplifications in the TLM in order to reduce 
the computational cost. As long as the adjoint model is derived from the approximate 
TLM, then the inner loop minimization will contain the correct gradient information 
for convergence. In coding the models by hand, it is easier to make such simplifica- 
tions based on physical arguments. For example, many meteorological models con- 
tain parametrizations of subgrid-scale processes (known as the physics in the meteo- 
rological literature), including such things as clouds, precipitation and surface drag. 
The schemes used to represent these processes can be highly complex and often in- 
clude nondifferentiable functions, such as on-off switches. While it is possible for 
automatic differentiation to deal with such functions, it is usually felt that this level 
of complexity is not necessary in the TLM and adjoint model. Hence, a series of sim- 
pler parametrizations have been developed solely for use in incremental 4DVar that 
capture the main behavior of the more complex schemes [64, 88, 99, 118]. 

An alternative approach, devised by the Met Office, is to start from the premise 
that the linear model must evolve finite and not infinitesimal perturbations so that 
there is no need for the linear model to be tangent to any nonlinear model. In this 
approach, the linear model is designed with this in mind. In particular, the resolved 
dynamics is approximated by a discretization of the linearized continuous equations, 


3 The term automatic differentiation refers to the approach itself, not just to the automatic tools. 
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with various simplifications in the equations and the discretization. Then simplified 
parametrizations can be used to represent subgrid-scale processes [74, 86]. The ad- 
joint model is derived from this approximate linear model by the process of automatic 
differentiation, ensuring that it provides the exact gradient of the discrete linear cost 
function. 

An essential part of the development of the linear and adjoint models is their 
testing, as any small mistakes could lead to lack of convergence of the minimization 
algorithms. Robust tests exist to check the coding of a TLM and adjoint model. The test 
for the TLM is based on comparing the evolution of a perturbation in the TLM with the 
evolution of the same perturbation in the nonlinear model. A Taylor series expansion 
of the nonlinear model operator shows that the evolutions should be closer together 
as the perturbation size is reduced [79, 98]. When an inexact TLM is used, the test 
is not able to differentiate between small coding errors and the desired inexactness. 
In this case, other more subjective tests must be performed [74]. The adjoint model 
code can be tested by a verification of the adjoint identity (2.13). If we assume that the 
spaces X1 and X2 are both equal to R”, then we must have 


(M;öx;,M;öx;) = (Xi, MY (Mj;6x;)) , (2.20) 
which, in the Euclidean inner product, is equivalent to 
(Mj5x;)" (Mj5x;) = ôx? (MT (Mj5x;)) . (2.21) 


This identity can be tested for random perturbations 6x;. If the adjoint operator M7 
has been correctly coded, then this identity will hold to machine precision [93]. For 
large codes, each of these tests should be available for each subroutine, as well as at 
higher levels. A further test, also based on a Taylor expansion, is used to verify that 
the gradient of the cost function has been correctly coded [93]. 


3.2 Background error covariances 


The background field x” is a very important part of practical data assimilation sys- 
tems in environmental forecasting. Since in many operational forecasting systems 
the background field is a forecast from a previous assimilation cycle, it contains infor- 
mation from observations assimilated at earlier times. In one of the early 4DVar sys- 
tems at ECMWF, it was shown that at any assimilation time, the background field has 
an approximately 85% influence on the analysis, with the new observations contribut- 
ing only 15% [24]. The background error covariance matrix B determines the relative 
weight between the background field and observations, and hence plays an essential 
role in the data assimilation algorithm. However, the calculation of these covariances 
for the assimilation system is a hugely complex task and very dependent on the spe- 
cific system being modeled. Here, we are only able to give an outline of the main steps 
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involved. For further details in the context of atmospheric data assimilation, the read- 
er is referred to the comprehensive two-part review article of Bannister [6, 7]. 

As was seen from (2.5) in Section 2, under certain simplified assumptions the 
analysis increment of 3DVar can be shown to lie in the subspace spanned by the 
columns of the matrix B. In order to understand the implications of this, we consider 
the case where we have a single observation y of the kt component of the vector x, 
with error variance 0. In this case, the observation operator is linear and is given by 
the k™® unit vector ex and the analysis equation (2.5) becomes 


bik 
bok | y -xP (k) 
x =x] , (1, (2.22) 
bk,k +06 
bu) 
where bj x,i = 1,...,N indicates the (i,k) element of the matrix B and x’ (k) is 


the kth component of x”. Hence, we see that the value of each entry bik, which is 
the covariance between the errors in the components of the background field x (i) 
and x” (k), determines the analysis increment to the i™ component of the state given 
an observation of the k® component. As a consequence, the entries of this matrix de- 
termine how observations are used to infer information about unobserved parts of the 
state. Thus, this matrix is fundamental in allowing information to be inferred about 
unobserved physical variables or unobserved regions of space. However, it is usual- 
ly impossible to represent this matrix in matrix form. If the state vector is of size n, 
then the matrix B is of size n x n and when n is of order 108, this matrix is impos- 
sible to calculate or store. Instead, the action of this matrix is usually represented by 
a variable transform. 

We consider the variable transform in the context of incremental variational data 
assimilation since that is how it is usually implemented. We define a new variable 
6z; € R” and a transformation matrix U; € R"*” such that 


ôXi = Uidzi, i=0,...,N. (2.23) 


In terms of this new variable, the incremental cost function (2.15) can be written as 


gm (520°) u ; (529° [z* zo |)" UZB-'Up (52% = |z? = 20®]) 
1x T (2.24) 
+3 (HiU,52\" - a) R7! (HiU: -a) . 
=0 


1 
If the variables 6z are chosen in such a way that they are uncorrelated, then they 
have the identity covariance matrix by definition and so u B~!Up can be replaced 
with the identity in the cost function (2.24). In this case, the cost function no longer 
contains the original background error covariance matrix; instead, it is implicitly de- 
fined through the variable transform with B = UpU§ ; 
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Furthermore, this variable transform is expected to lead to a better conditioned 
problem. To understand this, we note that the Hessian of the original inner loop cost 
function (2.15) is given by 


N 
G=B!+ Š M(t; to) HR; 'H;M(t;, to) , (2.25) 
i=0 
where 
M(tj, to) =M;-ıM;->2...Mo (2.26) 


is the tangent linear model solution operator from time to to time t;. Equivalently, we 
can write this as 


G=B!+HTR-!A, (2.27) 
where 
Ho 
X HıM(tı, to) 
H = , (2.28) 
Lee, a) 


and R is a block diagonal matrix with blocks equal to Rj,i = 0,..., N. If the back- 
ground error covariance matrix is ill-conditioned, then we expect this to dominate 
the conditioning of the Hessian G. We return to an examination of this in Section 3.4. 
On the other hand, the Hessian of the transformed problem (2.24) is given by 

N 

G=1I+ > UP M(t, to) H) R; 'H;M(t;, to) Ui. (2.29) 

i=0 
Usually, the number of observations is less than the number of state variables being 
estimated and so the Hessian (2.29) is equal to the identity plus a low rank matrix. 
Then, it has a minimum eigenvalue equal to one and the condition number (in the 
two-norm) is equal to the largest eigenvalue. Thus, we would expect the transformed 
problem to be better conditioned. 

Of course, this theory all relies on being able to choose appropriate variables 6z 
that are truly uncorrelated and it is here that a knowledge of the physical problem 
is necessary. In presenting how the transform is designed in practice, it is easier to 
think about it in terms of the inverse transform, from model variables 6x to uncorre- 
lated variables 6z. Acommon approach in numerical weather prediction is to split the 
inverse transform into two parts. The first part, which we write as U is known as 
the parameter transform and transforms to physical variables öx whose errors are as- 
sumed to be uncorrelated between themselves, but still contain spatial correlations. 
The spatial transform, U, ', then removes spatial correlations between the physical 
variables 6x. We thus have the steps 


ôx = U,!öx (2.30) 
6z = U;'6x, (2.31) 
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where for ease of notation, we assume the transforms to be time-invariant. In prac- 
tice, the transforms U, and U, may not be square and a generalization of the inverse 
operator is needed. We now consider each of these transforms in turn. 


3.2.1 Parameter transform 

In designing a suitable transform of parameters U;!, it is necessary to have an under- 
standing of the particular system being modeled in order to decide which variables 
have errors that are likely to be uncorrelated. For atmospheric models, the transform 
is based on the concept of balanced variables. Balance relationships are diagnostic re- 
lationships that exist between certain atmospheric variables. For example, in midlat- 
itudes and at large horizontal length scales, the horizontal wind is approximately in 
balance with the gradient of the pressure field through the relationship of geostrophic 
balance. This relationship can be used in the parameter transform by assuming that 
errors in the balanced part of the flow are uncorrelated with those in the unbalanced 
part [7]. This can be justified by an eigenanalysis of the linearized equation set, which 
shows that the balanced flow can be associated with one eigenvector and the unbal- 
anced flow with the remaining eigenvectors. Hence, under linear evolution, these will 
evolve independently. 

The variable that best represents the balanced flow in the atmosphere is potential 
vorticity (PV) [59] and so it would be natural to use this variable as the basis for the 
parameter transform. However, the transform from PV to the original model variables 
requires the solution of a three-dimensional elliptic equation as part of the applica- 
tion of the operator Uy. In dynamical regimes with small characteristic horizontal 
length scales, the PV is well-approximated by the vorticity, which only requires the 
solution of a two-dimensional equation [117]. Hence, early work in this area proposed 
a transform based on this variable [97] and this is still the basis of the parameter 
transform in many operational weather forecasting systems [7]. It is recognized that 
this approximation is not valid in all parts of the atmosphere and it has been demon- 
strated on simple systems that significant correlations can remain between errors in 
the transformed variables [66]. For this reason, attempts are being made to implement 
a transformation based on PV in large scale systems [8, 28]. 

A similar approach may be followed in other applications, for example, in ocean 
forecasting, though here there has been less work on the design of appropriate trans- 
forms than in the meteorological context. In many cases, it may be assumed that er- 
rors in the model variables such as salinity and temperature are uncorrelated and 
only the spatial transform is needed [116], but work on defining balance relationships 
has allowed multivariate covariances to be introduced [114]. 
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3.2.2 Spatial transform 

Once the parameter transform has been performed, it is assumed that the errors in 
the resulting variables are uncorrelated between themselves. At this point, it is neces- 
sary to specify the autocovariance information for each parameter through the spatial 
transform. In atmospheric models, it is common to assume that the transforms in the 
horizontal and vertical planes are separable. In most systems, a Fourier transform is 
used in the horizontal and for the vertical correlations a transformation to the eigen- 
vectors of a vertical error covariance matrix is used. The order in which these trans- 
formations are performed varies between systems. If the horizontal transform is per- 
formed first, then the horizontal spectral modes are assumed to be uncorrelated and 
vertical correlations are specified separately for each mode. This assumption leads to 
correlations that are homogeneous (independent of horizontal position) and isotrop- 
ic (independent of orientation) in the transformed parameters [7]. The method allows 
vertical correlations that vary with horizontal scale, so that features with large hori- 
zontal scale have deeper vertical correlations [36]. However, it does not allow vertical 
correlations to vary with horizontal position [62]. The alternative is to first perform 
the vertical transform and then, assuming that these modes are independent, apply 
the Fourier transform to each vertical mode. This allows more variation of vertical 
correlations with horizontal position (for example, with latitude). However, it is more 
difficult to obtain an appropriate variation in horizontal correlation length scales with 
height [62, 84]. In both cases, a scaling transformation is also needed to ensure that 
the variance of the transformed variables is equal to one. In an ideal case, we would 
like to obtain covariances that depend on both horizontal scale and horizontal posi- 
tion. This has led to the development of spatial transforms based on a wavelet basis 
[5, 36]. Such a transform has been implemented in the operational NWP system of 
ECMWF. 

In ocean models, the complex boundaries near the coast prohibit the simple use 
of a Fourier transform in the horizontal and so other methods must be used to repre- 
sent spatial correlations. For example, the application of a correlation operator can 
be shown to be equivalent to the integration of an appropriately-constructed diffusion 
equation [113]. This can be used to design correlation models for use in data assimila- 
tion systems with irregular boundary conditions [115, 116]. 

The use of transforms for spatial covariances requires the specification of corre- 
lation length scales and variances for each of the transformed variables. Since the 
background field is usually a short-term forecast, these statistics must represent the 
structure of errors in the forecasting system being used and thus be diagnosed from 
that. An early method for obtaining these statistical parameters used the difference 
between the observations and the background field (known as the innovations) [57]. 
However, a disadvantage of this method is that it relies on having a sufficient num- 
ber of observations and is therefore biased towards data-dense areas. The most pop- 
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ular method in atmospheric data assimilation is the “NMC method” [97]*. In this 
method, the difference between two different forecasts valid at the same time is tak- 
en as a proxy for forecast errors and statistics are taken over a sample of many such 
forecasts. In atmospheric forecasting, usually two forecasts starting 24 hours apart 
are used, with the earlier one run for 48 hours and the later one for 24 hours. By us- 
ing an interval of 24 hours, problems arising from modeling the diurnal variation of 
the atmosphere are avoided. However, this means that the differences are taken over 
a much longer time interval than the normal background forecast, which is usually 6 
or 12 hours. As a result, the covariance structures of the forecasts differences do not 
necessarily reflect those of the background error and often they need to be modified 
for use in the assimilation system [36, 62]. 

This has motivated the development of ensemble methods to generate statistics 
from shorter forecasts. Such a method for estimating background error statistics from 
an ensemble of short forecasts was developed for use at ECMWF in [36]. The basis of 
this method is that if the inputs to the assimilation system (for example, the back- 
ground, observations and physical boundary conditions) are perturbed within the 
statistics of their errors, then the perturbation in the resulting analysis will be drawn 
from the distribution of analysis error. If a short forecast is produced from this analy- 
sis, then we expect the perturbation to the forecast to be drawn from the distribution 
of forecast error. This perturbed forecast can then be used as a background field for 
the next assimilation time and the process repeated to produce the next analysis and 
another forecast. Suppose that we run two such cycles in parallel for l cycles, starting 
from two different sets of perturbations at time to. Then, at each assimilation time 
ti,i = 1,...,l, this will produce two perturbed short forecasts x? l and x? 2 It can 
be shown that the statistics of the true forecast error can then be calculated from the 
sample covariance of the differences between these pairs [6], 

l 


en = 1) a (x; = x?) (x?! = , (2.32) 


under the assumption that the errors in the two forecasts are uncorrelated. The factor 
of 1/2 arises since the sample covariance itself is equal to the sum of the error covari- 
ances of the two different sets of forecasts. Since the forecasts used in this method are 
of the same length as the forecasts used to obtain the background field in the assimi- 
lation, the error statistics produced in this way are a more realistic representation of 
the true error statistics. 

A key assumption in the methods presented so far is that the error covariance ma- 
trix represents a statistical average over time. The computational expense of calculat- 
ing these statistics means that the matrix is kept constant from day to day, perhaps 


4 So-called because it was first introduced in the National Meteorological Center of the USA, now the 
National Center for Environmental Prediction. 
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with different statistics being used with a change of season. More recently, there has 
been interest in developing methods for estimating statistics that vary from day to day 
since it is expected that the actual background errors will depend on the underlying 
flow. Such flow-dependent statistics arise naturally in ensemble methods of data as- 
similation, such as the ensemble Kalman filter. Methods are currently being designed 
to obtain some flow-dependent information in variational assimilation by combining 
information from ensembles of forecasts with the statistically-averaged error covari- 
ance matrix, for example, [15, 18]. 


3.3 Observation errors 


As well as representing the errors in the background field, it is important to treat 
the errors in the observations properly within a variational data assimilation sys- 
tem. Observational data received into operational weather and forecasting centers 
can contain errors from a variety of sources, including limitations in the measur- 
ing instrument, biases in the measurements and errors simply due to human error 
in recording the measurement. Furthermore, other errors arise from the way the data 
are used within the data assimilation system, both from inaccuracies in the operators 
used to map the model state to observation space and from the differences in spatial 
resolution between the model and the observations. The theory of variational data 
assimilation assumes that all observational errors are random, unbiased errors with 
a Gaussian distribution and known covariance. It is therefore important that as many 
ofthese sources of error as possible are accounted for in the data assimilation system. 

A first essential step in an operational data assimilation system is to perform 
a quality control check on the data themselves. This may consist of several stages. 
First, a check for obvious errors in the reporting of the data is made, for example, 
errors in the reported position. For example, if a ship observation is reported over 
aland point, it will be rejected from the assimilation. Then, a so-called “background 
check” may be made to see how close the observation is to the forecast background 
field. If the difference from the background is too large when compared with its 
expected error variance, then the observation may be rejected and not used in the 
assimilation [2]. Once this check has been performed, the next step is to identify 
observations that may have gross errors, that is, errors that are unlikely to satisfy 
the assumption of being random and normally distributed. This can be done either 
outside or within the assimilation process. Outside the assimilation, each observa- 
tion can be checked against nearby observations and any observations that largely 
disagree with others can be rejected [100]. Alternatively, this check can be included 
in the assimilation process using the variational quality control method [2, 63]. In this 
method, the probability density function of the observation errors are assumed to be 
a weighted combination of a standard Gaussian distribution and a flat probability 
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distribution function, with the weights determined by the probability of gross error of 
the observation. Thus, for each single observation y with weight &,, the probability 
density function of the observation error is assumed to be of the form 


Pac =(1- Qy) Pn + XyPr, (2.33) 


where Py indicates the appropriate Gaussian probability density function and Pr is 
a flat distribution over a finite interval centered at zero and is equal to zero outside 
this interval (the size of this interval is taken to be a multiple of the observation er- 
ror standard deviation). The observation part of the cost function is then taken to be 
equal to the negative logarithm of Poc. In the case where x, = 0, this corresponds 
to the observation term in the original nonlinear cost function (2.3). In this method, 
observations that have a high probability of gross error are given very little weight in 
the analysis. Initially, these probabilities are assigned to each observation based on 
a study of historical data. The probabilities are then updated on each iteration of the 
minimization procedure by comparison with the current estimate of the state to al- 
low observations to be given more or less weight as the assimilation progresses. The 
introduction of non-Gaussianity means that variational quality control can introduce 
multiple minima into the cost function and so it is necessary to have a good starting 
point for the minimization. For this reason the minimization is first run for several 
iterations without the quality control term before switching it on [1]. 

A second important aspect of observation errors is the treatment of systematic er- 
rors, or biases, in the observations. This is particularly important for satellite radiance 
data where biases may occur from changes in the measuring instrument over time or 
from errors in the radiative transfer model needed as part of the observation opera- 
tor [54]. Since the assimilation scheme assumes that the observations are unbiased, 
any biases in the observations can introduce biases into the analyses. As with the 
quality control, these biases may be treated offline or within the assimilation scheme. 
For each satellite channel, a bias model is assumed in such a way that we can define 
a new observation operator for the biased measurement 


H (x, P) = H (x) +b(B,x), (2.34) 
with 
Np 
b(B,x) = >, Bjp;(x), (2.35) 
j=0 


where p; are predictors for j = 0,..., Np and £; are scalar coefficients [33]. A few pre- 
dictor states are chosen that may be related to the state at the observation positions. 
The coefficients f} can then be estimated in an offline regression using a few weeks 
of data [54] or a variational procedure can be used to estimate these coefficients. This 
can be included directly in the assimilation procedure by including (2.34) in the cost 
function in place of the standard observation operator and including a background 
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estimate B? of ß with covariance Bg. The 4DVar assimilation problem is then to min- 
imize 


Jp (Xo, P) 5 (xo -= x?) B`! (xo — x?) + : (B — P”) Bz' (B = pP) 
N 
+ 5 >, Hila) + b(B, xi) - y) TR! (Hi) + b(B, xi) - Yı) 


(2.36) 


subject to the dynamical equations in order to estimate the state xg and the coeffi- 
cients B simultaneously [33]. Alternatively, a variational procedure can be used to 
estimate these coefficients offline at regular intervals using the previous value as the 
background for the new estimate [3]. 

Finally, we consider the specification of the observation error covariance matrix, 
which represents the covariance of the random components of the observation error. 
It is important to note that this error is defined by the difference between the actual 
measurement and the model representation of the true state xt mapped into observa- 
tion space by the observation operator, that is, the error €? at time t; is given by 


E? =yi- Hi (x) A (2.37) 


This means that the error includes different components arising from the accuracy 
of the measuring instrument (instrument error), errors in the observation operator 
H; and errors due to the difference in spatial resolution between the measurement 
and the model state (known as the representativity error). The instrument error is 
the easiest to treat since the variances of this error can usually be obtained from the 
instrument manufacturer and it is normally safe to assume that these errors are un- 
correlated. However, this may not always be the case. For example, measurements 
derived by preprocessing satellite data may include spatial correlations [17]. Errors 
in the observation operator may include such things as errors in the radiative trans- 
fer models used to model satellite data which can lead to error correlations between 
different satellite channels [16, 105]. 

Although it is recognized that observation error correlations exist, particularly 
with respect to satellite data, the correlations are not usually very well treated in cur- 
rent operational forecasting systems. Often, the correlations are ignored and it is as- 
sumed that the observation error covariance matrix is diagonal. To balance this as- 
sumption, either the error variances are inflated [56] or the data are thinned so that 
fewer of them are used [32]. The reasons for this are the difficulty in calculating what 
the error correlations should be and the difficulty in then representing these correla- 
tions within an assimilation scheme in a way that the inverse correlation matrix can 
easily be applied. To estimate the correlations in satellite data, the methods that have 
mainly been used are a comparison with independent measurements from radioson- 
des based on the method of [57], and the use of diagnostics calculated from the data 
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assimilation system itself based on [35]. Various ways of representing these correla- 
tions within the data assimilation system have been proposed, including the use of 
a circulant matrix [55], an eigenvalue decomposition [37] and a Markov matrix [105]. 
However, thus far, little use of these methods exists in operational practice. 


3.4 Optimization methods 


The minimization of the inner loop cost function (2.15) requires the use of a suitable 
optimization algorithm. For the large problems of environmental modeling, there are 
two particularly important constraints. The first is that because of the number of vari- 
ables in the system, it is not possible to obtain second derivative information. The 
Hessian or second derivative matrix would contain of the order 10!° elements, which 
is impossible to calculate or to store. Hence, only methods that require first derivative 
information can be used. The second constraint is that often these problems must be 
solved within a real-time forecasting system and hence the computer time that can 
be used to solve the problem is very limited. Hence, the methods use as few function 
evaluations as possible. This means that usually the problem is not allowed to run to 
full convergence and the use of any line search algorithms is prohibitively expensive. 
Traditionally, the algorithms that are most commonly used within data assimilation 
systems are quasi-Newton algorithms and conjugate gradient or related Lanczos algo- 
rithms, which only require first derivative information to be provided. The mathemat- 
ical details of these algorithms are well explained elsewhere (e.g. [94]), and so here 
we limit the discussion to their implementation in data assimilation systems. 

An essential aspect of the minimization procedure for variational data assimi- 
lation is an appropriate preconditioning. Experimental evidence indicates that the 
Hessian of the inner loop cost function (2.15) is badly conditioned and that this aris- 
es from the ill-conditioning of the background error covariance matrix [83]. This has 
been further confirmed by theoretical results that bound the condition number of the 
Hessian of the cost function in terms of the condition number of this covariance ma- 
trix [50, 51]. The first level of preconditioning that is applied is therefore to transform 
the problem to new variables, as described in Section 3.2. The transformed problem 
(2.24) can be shown in general to be better conditioned both in theory and in practice 
[42, 50, 51, 83]. However, even after this transformation, the problem is not very well- 
conditioned and can have a condition number of order 10-10? [39, 52]. Experiments 
in the ECMWF system showed that the ill-conditioning that remains is related to the 
inclusion of dense, accurate surface observations over Europe [110] and this has also 
been shown to be true for the system of the Met Office [52]. This can be explained by 
theoretical bounds obtained by [50, 52], which show that the condition number of the 
transformed problem increases as the spacing between observations decreases and as 
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observations become more accurate. Hence, ideally, asecond level of preconditioning 
is required after the variable transformation has been performed. 

In order to implement a further preconditioning, it is necessary to find a precondi- 
tioning matrix K that is inexpensive to compute and such that the eigenvalues of KG 
are more clustered than those of the Hessian G of the transformed problem. Often, 
the preconditioning matrix may be represented in the factored form K = PP’ and the 
preconditioning matrix Pis then used directly, for example in the preconditioned con- 
jugate gradient method [111]. In order to design such a preconditioner, some knowl- 
edge of the Hessian (2.29) of the transformed cost function is required. One way that 
this can be obtained is by using a Lanczos algorithm to perform the inner loop min- 
imization. The Lanczos method produces estimates of the leading eigenvectors and 
eigenvalues of the Hessian of the function being minimized. If the first m eigenvalues 
A; and eigenvectors uj, j = 1,...,m have sufficiently converged, then the inverse of 
the Hessian (2.29) can be approximated by the expression 


m 
K=1+ > (àj - Du. (2.38) 
j=1 
This expression can then be used for the preconditioning of subsequent minimiza- 
tions under the assumption that the Hessian does not change greatly between one 
minimization and another [39, 111]. This method, known as spectral preconditioning, 
is used in the operational forecast system of ECMWF, where three outer loops are per- 
formed for each assimilation. During the first inner loop minimization, the Lanczos 
vectors are stored and these are then used to precondition the minimization of the 
second and third inner loop cost functions [39]. It has been shown that this precon- 
ditioner belongs to a larger class of limited memory preconditioners [111]. In order to 
define this class, we let s; € R”,i = 1,...,l, with l < n, bea set of G-conjugate 
vectors. Then, the limited-memory preconditioning matrix is given by 


l sis! E l sis! l sis 
Ki =|In- > Ts C] (T SG Tes td TGs (2.39) 
i i ; i Si 


If the vectors s; are chosen to be the eigenvectors of G, then this formula results in the 
spectral preconditioning matrix (2.38). 

The authors of [111] propose an alternative preconditioner from the same class 
based on the Ritz pairs of the Hessian. Ritz pairs are approximate eigenpairs (0;, vi) 
defined in an appropriately chosen subspace. By choosing the subspace to be that 
spanned by the Lanczos vectors, the authors obtain the Ritz limited memory precon- 
ditioner 


l l T l T 

Viv; ~ ViV; Viv; 

KË? = (n- > ru e) I >G 2 +> 7 (2.40) 
i=1 “i i=1 t i=1, + 

They found that the use of this preconditioner can provide an improvement over spec- 

tral preconditioning when the estimates of the Hessian eigenpairs are inaccurate. 
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A similar result was also found in the Regional Ocean Modeling System (ROMS) in 
which both of these preconditioners are implemented [91]. One drawback of both of 
these methods is that in order to generate the required information, the first mini- 
mization must be performed in order to generate the vectors s; before any precondi- 
tioning can be applied. Thus far, little attention has been paid to preconditioning of 
this first minimization. 

With any minimization method, it is important to specify appropriate stopping 
criteria and this is also the case in variational data assimilation. As discussed in Sec- 
tion 2.1, it has been proved that the inner loop step of the Gauss-Newton method (step 
1.1 of Algorithm 2.1) needs to be solved to sufficient accuracy in order to ensure con- 
vergence of the outer loops [45]. The theory has been used to show how it is natural to 
use an inner loop stopping criterion based on the relative change in the norm of the 
sradient, of the form = 

iv Ae 1 La (2.41) 

IVI (0) IB 
where the subscript indicates the inner loop iteration index and e is a specified toler- 
ance [73]. The tolerance used to stop the iterations must therefore be chosen careful- 
ly. If it is too high, then there is no guarantee that the outer loop steps will converge. 
However, the convergence should not be pushed below the level of noise on the ob- 
servations, as then small spatial scales are adjusted to fit the observational noise [68]. 
In many practical forecasting problems, such care is not always taken and other cri- 
teria are introduced. There are two main reasons for this. One is that in a time-critical 
forecasting system, it may be considered more important to solve each minimization 
problem using approximately the same amount of wall-clock time rather than to the 
same accuracy. The second reason is that the preconditioning techniques described 
in this section require a minimum number of iterations to be performed on the first in- 
ner loop minimization in order to acquire sufficiently accurate information about the 
Hessian. Hence, criteria that have been introduced include stopping the iterations 
when the value of the cost function is close to its expected minimum value [84] or 
using a fixed number of iterations, particularly for the first minimization [110]. 


3.5 Reduced order approaches 


As was mentioned in Section 2.1, a major advantage of the incremental approach is 
that the inner loop problem may be solved in a smaller dimensional space than the 
outer loop update of the linearization trajectory. Within environmental prediction, 
lower spatial resolution systems have often been used in the inner loop step, with the 
full resolution nonlinear model being used in the outer loop. Further simplifications 
may also be made to the linear dynamical model used in the inner loop, such as us- 
ing simplified parametrizations of subgrid-scale processes as described in Section 3.1. 
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While a change in resolution is certainly the simplest way to achieve a more compu- 
tationally tractable inner loop problem, it does not necessarily provide the most ac- 
curate low order representation of the linearized cost function and its constraint. In 
order to improve on this, other reduced order approaches have been investigated in 
the context of incremental 4DVar. These essentially fall into two categories, methods 
based on principal component analysis and methods based on near-optimal reduc- 
tion of dynamical systems. 

Principal component analysis, which is often referred to as principal orthogonal 
decomposition (POD) or the method of empirical orthogonal functions (EOFs), aims 
to represent the solution of the assimilation problem as a linear combination of basis 
vectors. The basis vectors are chosen to represent the leading directions of variability 
in the model and are calculated using a series of model states, or “snapshots”, from 
an integration of the nonlinear model. Such a method was used in an ocean model as- 
similation by [101]. From the sample of model states, the authors generate the matrix 
X = (Xj,...,X,), where X; is the difference between the model state at time t; and 
the mean state. The covariance matrix XXT / (l — 1) is then diagonalized to find a set 
of orthonormal eigenvectors v; (EOFs) and associated eigenvalues A;,i = 1,...,1.° 
The solution öxo to the inner loop minimization problem (2.15) is then defined by 
an expansion of the leading r eigenvectors 


r 
ÔXọo = > wivi = Vw (2.42) 

i=0 
where V = (V1,...,V,) is the matrix of the leading r eigenvectors and the vector 
w = (wı,...,w,)T contains the weights to be determined. In this case, the matrix 


V acts as a variable transformation in a similar way to the parameter transform (2.23) 
and so the background term can be written in the form 


In(w) = 5w'B,'w, (2.43) 


where the covariance matrix B,, is taken to be the diagonal matrix of eigenvalues. The 
number of vectors r that are used in the expansion is chosen in order to ensure that 
a large fraction of the total variance is retained, where this fraction is calculated from 
the eigenvalues as 

Xia Ai 
This method has been applied to assimilation in ocean models in an idealized setting 
[101] and using real data [60]. It is noted that the assumption behind this method is 
that the variability of the system can be well described by a low-dimensional space. 


(2.44) 


5 In practice, the eigenvalues can be found by diagonalizing the much smaller matrix XTX/(l- 1) 
[11]. 
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Although the approach reduces the size of the space in which the minimization is 
performed, the tangent linear model (2.16) must still be integrated at full resolution 
on each iteration. 

An alternative approach, based on POD, was put forward by [22, 23]. In that work, 
the solution to the full nonlinear 4DVar problem is expressed as a perturbation from 
the sample mean that is expanded in terms of basis functions ®; such that 


m 
ôxo = Ò wi®;, (2.45) 

i=0 
where w; are again weights to be determined. The basis functions are derived in a sim- 
ilar way to the EOFs, but by then projecting the perturbation fields X onto the eigen- 


vectors vi, and thus 
&={d,,...,®}=XV. (2.46) 


The number of basis functions that are used in the expansion is again determined us- 
ing the fractional variance (2.44). In this work, the authors solve the nonlinear 4DVar 
cost function (2.3) in the reduced space. As well as expressing the background term in 
terms of coefficients of the basis functions, they also derive a Galerkin projection of 
the dynamical model onto the basis functions for use in the observation term. Thus, 
this formulation has the advantage that the dynamical model and its adjoint are also 
expressed in the reduced space. Again, this method relies on the snapshots being able 
to capture a low-dimensional subspace that adequately describes the full system. 

A disadvantage with both the EOF and POD methods is that they do not use 
any information about the data assimilation problem itself within the reduction pro- 
cedure. There have been two approaches proposed to improve on this. The first is 
an adaption of the POD method, called dual-weighted POD. In this method the snap- 
shot perturbations X are weighted according to the sensitivity of the cost function at 
the time of the snapshot, where the weights are calculated using the adjoint mod- 
el [30]. The other approach, put forward in the series of papers [14, 75, 76], is to use 
near-optimal model order reduction methods for linear dynamical systems to derive 
a reduced order model and observation operator. The inner loop problem of incre- 
mental 4DVar (2.15) is subject to the dynamical system described by the evolution 
equation (2.16) and the output equation 


di = Hiöx;. (2.47) 


Model reduction seeks linear restriction operators ST and prolongation operators T; 
that map the perturbation ôx; € R” to öx; € R” withr <« n. These operators are 
chosen such that the output ofthe projected system 
OXi+1 = SIM;TiöX; (2.48) 
di = H,Tj6x; (2.49) 


78 —— Amos S. Lawless 


approximates well the output of the full dynamical system d;. The inner loop problem 
can then be defined in the reduced space as the minimization of 
I fez T 
5 (ox? si [x? 20% ]) 
=] x 
x (S}BoSo) (x)? - SẸ [x — xo |) 
ie > (k) Oane] a (k) (k) 
+= > (HiT6x\" - d\) Ro} (HT: - al”) , 


i 
2 i=0 


min j® [6x] = 


subject tothereduced dynamical model (2.48). The linearization state isthen updated 
with the perturbation 
Oxy = T dz. (2.50) 
The authors of these papers use the method of balanced truncation [92] to demon- 
strate this method in the case where the operators M and H are time-invariant. The 
aim of balanced truncation is to truncate the states of the system that are least affect- 
ed by the inputs and have least effect on the outputs. Since these are not generally 
the same, the first step in the method is to transform the system into one in which 
these states coincide, the “balancing” step. It is first necessary to find the state co- 
variance matrices P and Q associated with the inputs and outputs respectively. These 
are found by solving the Stein equations 


P = MPM’ +B (2.51) 
and Q=M’QM+H/R!H. (2.52) 
The balancing transformation Y is then given by the matrix of eigenvectors of PO, 


while the eigenvalues of PQ are equal to the Hankel singular values of the full system. 
The reduction step then calculates the restriction and prolongation operators from 


ST = [1,0]! (2.53) 
I, 
rti]: (2.54) 


where the decay of the Hankel singular values is used to choose the model reduction 
order r. In idealized models the studies [14, 75, 76] show how this method improves 
the solution with respect to using low resolution models and how it is important to 
use information about the assimilation problem in the reduction procedure, including 
information about the background and observation error covariance matrices. How- 
ever, whereas reduction methods based on POD can be implemented in large systems, 
the method of balanced truncation cannot. Although efficient numerical methods 
are available to apply balanced truncation to systems of moderately large size (e.g. 
[25, 53, 69]), these are not suitable for the very large systems found in environmental 
prediction. Efforts are being made to design near-optimal reduction methods for such 
systems based on Krylov methods [21], but these methods have not yet been tried out 
in data assimilation for large systems. 
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3.6 Issues for nested models 


For very high resolution weather and ocean forecasting operational centers often use 
models covering only the domain of interest that are nested in a larger model, often of 
lower resolution, which we refer to here as the parent model. In most of the systems 
the nesting is a one-way nesting, whereby lateral boundary conditions for the nest- 
ed model are provided by the parent model, but there is no feedback from the high 
resolution nested model to the parent model. This presents particular challenges for 
the application of variational data assimilation. For problems specific to high resolu- 
tion weather forecasting we refer the reader to the review articles [96] and [31]. Here 
we consider only more general problems arising from using a high resolution nested 
grid, in particular treatment of the lateral boundary conditions and of the difference 
in representation of spatial scales between the parent and nested models. 

With respect to the lateral boundary conditions, a decision must be made as to 
whether to estimate them as part of the assimilation procedure or to assume that they 
do not change. Both approaches have been used in practice. In the operational weath- 
er forecasting system of the Met Office the lateral boundary conditions are not updat- 
ed, but are fixed by the parent model. Hence the increment 6x on the boundary is 
set to zero. This has advantages for the practical implementation of the scheme. In 
particular it allows a simple sine transform to be used in the definition of the spa- 
tial background error covariances described in Section 3.2, which then enforces zero 
boundary increments [83]. However, observational information close to the bound- 
aries can be difficult to use, since the nested model cannot use observations lying 
outside the domain and the analysis inside the domain may not be consistent with 
the boundary conditions provided [4, 47]. This can lead to features being artificially 
cut-off close to the boundaries. 

The alternative approach is to estimate the boundary variables within the assim- 
ilation procedure [48, 49, 67]. This means that the state vector x is defined to include 
both the variables in the interior of the domain and on the lateral boundaries. In this 
way observations inside the nested domain can update the boundary values and so it 
is possible to ensure that the analysis is consistent throughout the domain. However 
in this case it is no longer possible to apply a sine transform to impose the spatial 
background error covariances. In order to be able to apply a spectral transformation 
an extension zone is created around the domain to obtain fields that are horizontal- 
ly periodic. A Fourier transform can then be applied. One difficulty in analyzing the 
boundaries in this way is that the lateral boundary conditions are only updated dur- 
ing the assimilation period. During the subsequent forecast no updates are available 
and the values from the parent model must be used, so there is some inconsistency 
between the boundary conditions of the analysis and those of the forecast. Howev- 
er, some consistency over the assimilation window can be ensured by estimating the 
boundary conditions at the beginning and end of the assimilation window, with both 
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constrained by background values from the parent model. In this case the cost func- 
tion to be solved is ofthe form 


1 E 1 To 
J (Xo, Xibc) = 5 (Xo - x”) "B ‘(xo — x?) + 5 (Kine -xh,.) Bi (tbe - X40) 


N 
X (HG (xi) — vi) RP! (Hi (Xi) — Yi) , 


i=0 


Sl 


(2.55) 


where Xo represents the model variables in the interior of the domain and the lateral 
boundary conditions at initial time to, Xp. is the lateral boundary condition at final 
time ty, x” is the background estimate of xo, with error covariance matrix Band x, P 
is the background estimate of Xıpc, with error covariance matrix Bipc [67]. 

The second challenge we consider is the difference in the spatial scales that can 
be represented in the nested and parent models. In particular, since the nested model 
often covers only a small domain, the assimilation scheme is not able to adequately 
analyze scales of the size of the domain and larger. In applications such as weather 
prediction, it is important to capture these larger scales since the physical system is 
inherently multiscale with strong feedbacks between large and small scales. Hence, 
attempts have been made to improve the large scale information in nested model data 
assimilation by providing information on these scales from a parent model analysis. 
For example, the Met Office experimented with a system that combined large scale 
increments from a parent model analysis with the small scale increments from the 
nested model analysis [4]. In this method, the large scales of the nested model analy- 
sis are forced to be equal to those of the parent model. 

An alternative, proposed by [47], is to use the large scales of the parent analysis 
over the nested model domain as a weak-constraint on the variational problem. We let 
xy, be the analysis from the parent model and define operators Hy and Hy, such that 
Hy (x5) represents some large scales of the parent analysis on the nested domain 
and Hn (x) represents the same large scales from the nested model field x. Then, the 
difference between the large scales of the global analysis and those forecasted by the 
nested model can be constrained by adding an extra term to the cost function (2.3) of 
the form 


> (Hy (x a) — Hn (x)) Bp! (Hp (x) - Hn (x)) , (2.56) 


where By is the error covariance matrix of the parent model large scales. This means 
that the analysis is constrained by large scales from the parent model through this 
additional term, and by large scales from the nested model through the background 
term. In theory, this should introduce another term including the cross-correlation 
between these two sources of information. However, in their demonstration of the 
method in a 3DVar scheme of the ALADIN model at Météo-France, the authors of [47] 
concluded that this cross-correlation could be neglected, though at the cost of some 
inaccuracy. 
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A more theoretical study of this problem was carried out by [9]. They used a spec- 
tral analysis to show how information from waves longer than the domain size is pro- 
jected onto different scales in the nested model domain corresponding to the lowest 
wave numbers that can be represented on this domain. They demonstrated that by 
giving more weight to these scales in the background term of the cost function, it 
was possible to retain more of the large scale information from a parent model back- 
ground. In this method, the large spatial scales from only the parent model are used 
as a constraint in the assimilation, as in [4], but they are not imposed exactly and may 
be altered by the assimilation process. The authors of [9] demonstrated benefit from 
this in an idealized system, but the method has not been tested in a realistic model. 


3.7 Weak-constraint variational assimilation 


The formulation of variational data assimilation presented in Section 2 assumes that 
the discrete dynamical model (2.1) is an exact representation of the physical system 
being observed. In practice, we know that the models contain errors caused by lim- 
itations in our knowledge of the physical equations and limitations in the numeri- 
cal modeling, for example, the need for subgrid scale parametrizations. In theory, it 
is possible to account for and estimate such errors in variational data assimilation, 
though implementation in practice is more complicated. We assume an additive error 
to the model equations, and thus the true dynamical system can be written as 


Xi+1 = Mi(xi) + Ni, (2.57) 


where n; are the unknown model errors at times t; which are assumed to be random, 
serially uncorrelated, Gaussian errors with covariance matrix Q;. Then, we can define 
a weak-constraint 4DVar problem in which the model equations do not have to be 
exactly satisfied over the assimilation window. We define a cost function of the form 


JI(Xo, No, ---, Ny-1) = 1 (xo - xb)" B- (xo — x’) 


2 
1 N 1 N-1 
+5 > (Hi (i) — vi) RT Hi (KD) — yd) + 5 mn 
i=0 i=0 


(2.58) 


subject to (2.57). The weak-constraint problem is then to minimize (2.58) with respect 
to the initial state xq and all the model errors nį. 

An alternative formulation of the weak-constraint problem (2.58) is to write it in 
terms of the model state x; at each time t; rather than in terms of the model errors. 
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This leads to the cost function 


I(&0,X1,...,Xn) = (xo - x?) Bo! (xo — x”) 


(Hi(xi) - yi) R7 (Hi(xi) - yi) 


NLR 
Mz 


2 (2.59) 


N- 
> (Xi — M:i (Xi) TO (Xi -Mi&i)). 


In [109], both formulations were presented in the incremental version of 4DVar as 
possibilities for inclusion in the ECMWF system. 

The inclusion of the model errors at each observation time increases the size of 
the argument of J by a factor of N + 1, the number of observation times. One way 
to reduce this cost is by assuming a relationship in time between the model errors 
ni. Theoretical work by [46] used an augmented state approach to solve for the state 
and the model error, with a dynamical equation used to explain the evolution of the 
error. The authors introduced a general form for the error evolution, including both 
a systematic and random component of the error. Various options for the systematic 
evolution were proposed, including a constant bias error and simple dynamical evolu- 
tions, and the methods were illustrated on simple systems. In the context of a regional 
atmospheric model, [119] demonstrated a weak-constraint 4DVar system under the as- 
sumption that the model error was serially correlated and obeyed a first-order Markov 
process. 

Since this early work, there have been several idealized studies with weak- 
constraint 4DVar, but the move towards operational implementations in large scale 
systems has been slow. One of the biggest remaining challenges is the specification of 
the model error covariance matrix Q; for real systems. An initial idea was to take this 
matrix to be a scalar multiple of the background error covariance matrix B. However, 
in experiments with the ECMWF atmospheric forecasting system using formulation 
(2.58), [110] showed that this choice implies that corrections to the model error lie 
in the same space as those to the background. This leads to estimates of model er- 
ror that are very similar to the increments to the initial conditions. An alternative 
method, proposed in the same paper, is based on the use of model tendency fields, 
that is, fields of the change in model variables over a model time step. The statis- 
tics of Q; are estimated from an ensemble of differences between model tendency 
fields using the NMC method in a similar way that differences between the model 
fields themselves are used in the estimation of the background error covariances (as 
explained in Section 3.2). [110] interprets differences between these tendencies as 
a proxy for the uncertainty in the model forcing. The statistics from this sample are 
then fit to the same statistical model as is used for the matrix B. The use of a covari- 
ance matrix estimated in this way was tested in weak-constraint 4DVar experiments 
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that assumed a constant error over the assimilation window. This was shown to give 
an improvement over the use of a covariance matrix defined by a scalar multiple of B. 

The work of [80] illustrated the implementation of weak-constraint 4DVar using 
such a matrix, again in the ECMWF system, to estimate a constant bias error in the 
stratosphere where the model is known to have biases. A similar scheme has been 
introduced into the operational assimilation system of ECMWF [40]. In this imple- 
mentation, the deviation of the error from its mean value is minimized, and thus the 
last term of (2.58) becomes 


1 
= (n-mMO"T(mn-n)), (2.60) 


where n is the estimate of the model bias from the previous analysis cycle. In this way, 
the assimilation ensures that the estimated error does not vary too quickly from one 
analysis cycle to the next. 

Despite these initial successes, much more work is needed. One particular dif- 
ficulty is that it is not clear how to differentiate between model bias and observa- 
tion bias since the assimilation only measures the difference between the model and 
the observations. [110] showed a case study of observation bias being interpreted as 
a model error by weak-constraint 4DVar. This problem was discussed further by [78] 
in the context of ocean data assimilation. They suggested that to estimate both model 
and observation bias, it is necessary to include information on the spatial and tempo- 
ral structure of these biases in the covariance matrices. 

In order to then move away from the assumption of a constant bias and treat 
time-varying systematic and random model errors, more sophisticated methods for 
describing the evolution of errors must be developed. This evolution is likely to be 
dependent on the specific model being used, yet general methods for representing 
this are also needed. At the same time, efficient and accurate representations of the 
covariances of these model errors must be found. The use of the weak-constraint for- 
mulation of 4DVar holds much promise to counteract the inadequacies of models, but 
many challenges remain open to be able to implement this in very large environmen- 
tal models. 


4 Summary and future perspectives 


Variational data assimilation is now a well-established method for combining obser- 
vational data with very large environmental models. However, as illustrated in this 
article, its successful implementation requires careful and judicious choices in each 
aspect of the assimilation scheme. In some cases, these choices are determined by 
the physical system being modeled or the observational data available, for example, 
the specification of the error covariances in the system. In other cases, the choices 
may be determined by the size of the problem and the need to solve it in an efficient 
manner, often for real-time forecasting, or by features of the numerical model itself, 
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such as lateral boundary conditions. In each instance, the choices to be made will in- 
evitably be a compromise between the ideal solution and what is practically feasible 
in a given system. We have presented some of the solutions that have been found that 
have allowed variational data assimilation to be implemented in large environmen- 
tal forecasting systems. Nevertheless, much research continues to improve on these 
solutions so as to find better estimates of the state and so produce better forecasts. 

One particularly active area in numerical weather prediction is the desire to use 
more information from ensembles of forecasts to provide time-varying covariances 
for the background errors, combining the advantages of ensemble filtering methods 
with the advantages of 4DVar. ECMWF have implemented a system in which an en- 
semble of 4DVar assimilations are run and the statistics from this ensemble are used 
to update the variances of the background errors [15]. Extensions to this method to 
also calculate the covariance information are being sought. An alternative approach 
is to use information from ensembles of forecasts to calculate covariance information 
throughout the whole assimilation window. This method was proposed by [81] and 
tested in a global weather prediction model by [19, 20]. An advantage of this method 
is that the tangent linear and adjoint models are not required in the 4DVar since all 
the evolution information comes through the ensemble of nonlinear model forecasts. 
Hence, this makes development of the system much easier. 

Besides the many great challenges that we have discussed in this article, new 
challenges are arising for the future evolution of variational data assimilation 
systems. The advent of massively parallel computers means that the algorithms 
used currently to solve the assimilation problem may no longer be efficient on fu- 
ture computer architectures. Hence, work is needed to develop new algorithms to 
solve the problem, particularly with respect to efficient minimization and precon- 
ditioning methods. This may be easier as systems move to a weak-constraint form 
of 4DVar but, as discussed above, that introduces its own difficulties [40]. Another 
challenge comes from the move towards more integrated Earth-system models, with 
different environmental models coupled to each other. For example, for seasonal to 
decadal prediction, it is now common to use coupled atmosphere-ocean models, but 
the initialization of these models with data assimilation is still in its infancy. Partic- 
ular problems arise from the very different time scales in the atmosphere and ocean 
system and from the model biases in atmosphere and ocean models. Some work 
has been done to implement 4DVar in such systems in order to estimate the ocean 
state and coupling parameters [89, 106], but the estimation of the complete state in 
coupled atmosphere-ocean models remains an open problem for the coming years. 
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Abstract: This survey paper is written with the intention of giving a mathematical 
introduction to filtering techniques for intermittent data assimilation, and to survey 
some recent advances in the field. The paper is divided into three parts. The first part 
introduces Bayesian statistics and its application to statistical inference and estima- 
tion. Basic aspects of Markov processes, as they typically arise from scientific models 
in the form of stochastic differential and/or difference equations, are covered in the 
second part. The third and final part describes the filtering approach to estimation of 
model states by assimilation of observational data into scientific models. While most 
of the material is of survey type, very recent advances in the field of nonlinear data 
assimilation covered in this paper include a discussion of Bayesian inference in the 
context of optimal transportation and coupling of random variables, as well as a dis- 
cussion of recent advances in ensemble transform filters. References and sources for 
further reading material will be listed at the end of each section. 
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1 Bayesian statistics 


In this section, we summarize the Bayesian approach to statistical inference and esti- 
mation in which probability is interpreted as a measure of uncertainty (of the system 
state, for example). Contrary to closely related inverse problem formulations, all vari- 
ables involved are considered to be uncertain and are described as random variables. 
Furthermore, uncertainty is only discussed in the context of available information, 
requiring the computation of conditional probabilities; Bayes’ formula is used for sta- 
tistical inference. We start with a short introduction to random variables. 


The second author would like to acknowledge support from NERC grant NE/102013X/1. 
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1.1 Preliminaries 


We start with a sample space O which characterizes all possible outcomes of an exper- 
iment. An event is a subset of Q and we assume that the set F of all events forms a o- 
algebra (i.e. F isnonempty and closed over complementation and countable unions). 
For example, suppose that Q = R. Then, events can be defined by taking all possible 
countable unions and complements of intervals (a, b] c R; these are known as the 
Borel sets. 


Definition 3.1 (Probability measure). A probability measure is a function 

P : F — [0,1] with the following properties: 

(i) Total probability equals one: P(Q) = 1. 

(ii) Probability is additive for independent events: If Aj, A2,..., An,... is a finite or 
countable collection of events A; € F and Aj N Aj = Ø fori # j, then 


P (U;Ai) = > P (Ai). 


The triple (Q, F, P) is called a probability space. 


Definition 3.2 (Random variable). A function X : Q — R is called a (univariate) ran- 
dom variable if 
{tw EQ:X(w) <x} EF 


for all x € R. The (cumulative) probability distribution function of X is given by 
Fx(x)=P({wE€Q:X(w)<x}). 


The cumulative probability distribution function implies a probability measure on R 
which we denote by ux. 


Often, when working with a random variable X, the underlying probability space 
(Q, F, P) is not emphasized; one typically only specifies the target space X = Rand 
the probability distribution or measure ux on X. We then say that ux is the law of X 
and write X ~ ux. A probability measure ux introduces an integral over X and 


Ex Lf] = | fodux(dx) 
x 


is called the expectation value of a function f : R — R (f is called a measurable 
function where the integral exists). We also use the notation law(X) = ux to indicate 
that ux is the probability measure for a random variable X. Two important choices 
for f are f(x) = x, which leads to the mean X = Ex[x] of X, and f(x) = (x - X)’, 
which leads to the variance o? = Ex[(x - X)?] of X. 
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Univariate random variables naturally extend to the multivariate case, i.e. 
X = RN, N > 1. A probability measure ux on X is called absolutely continuous 
(with respect to the standard Lebesgue measure dx on RY) if there exists a probabili- 
ty density function (PDF) try : X — R with 1rx(x) = 0, and 


Exif] = | Foomx(ax) = | Foem Coax 
x RN 


for all measurable functions f. The shorthand ux(dx) = trydx is often adopted. 
The implication is that one can, for all practical purposes, work within the classical 
Riemann integral framework and does not need to resort to Lebesgue integration. 
Again, we can define the mean x € RN of a multivariate random variable and its 
covariance matrix 


Paty [œ X) (x x)" | RNXN 
Here, aT denotes the transpose of a vector a. We now discuss a few standard distri- 


butions. 


Example 3.3 (Gaussian distribution). We use the notation X ~ N(x, g?) to denote 
a univariate Gaussian random variable with mean X and variance 0°, with PDF given 
by 


1 eee 
Ty (x) = —e FX ER, 


In the multivariate case, we use the notation X ~ N(X, >=) to denote a Gaussian ran- 
dom variable with PDF given by 


Try (x) exp ( (x x)TZ1(x —X)), x ER, 


1 
(2r) N2212 
Example 3.4 (Laplace distribution and Gaussian mixtures). The univariate Laplace 
distribution has PDF 
À -Alx 


Ttx(x) = xER. 


This may be rewritten as 


1 -x?/(20?) Ae -20/2 
= e — e do, 
| V2TO 2 
which is a weighted Gaussian PDF with mean zero and variance o integrated over 


o. By replacing the integral by a Riemann sum over a sequence of quadrature points 
toH we obtain 


J 2 
—~x2/(202 A EE vie 
Try (x) “m XE IEOS) Oy oc Ze Noil Cj — Oj) 


m 
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and the constant of proportionality is chosen such that the weights œj sum to one. 
This is an example of a Gaussian mixture distribution, namely, a weighted sum of 
Gaussians. In this case, the Gaussians are all centered on x = 0; the most general 
form of a Gaussian mixture is 


with weights &; > O subject to &j = 1, and locations ~œ < x; < o. Univariate 
Gaussian mixtures generalize to mixtures of multivariate Gaussians in the obvious 
manner. 


Example 3.5 (Point distribution). As a final example, we consider the point measure 
Hx, defined by 


[ron (dx) = fo). 
x 


Using the Dirac delta notation 6(-), this can be formally written as ux (dx) = 
ô(x -xo)dx. The associated random variable X has the certain outcome X(W) = Xo 
for almost all w € Q. One can call such a random variable deterministic, and write 
X = xo for short. Note that the point measure is not absolutely continuous with 
respect to the Lebesgue measure, i.e. there is no corresponding probability density 
function. 


We now briefly discuss pairs of random variables X] and X» over the same target 
space X. Formally, we can treat them as a single random variable Z = (X1, X2) over 
Z = X x X with a joint distribution Ux, x, (X1,X2) = Uz(Z). 


Definition 3.6 (Marginals, independence, conditional probability distributions). Let Xı 
and X2 denote two random variables on X with joint PDF try, x, (x1, X2). The two 
PDFs 
Tx, (x1) = | tex (er, xoda 
x 
and 
Tx (X2) = (max. (x1,x2)dx1, 
x 
respectively, are called the marginal PDFs, i.e. X; ~ Trx, and X2 ~ Trx,. The two 
random variables are called independent if 


TEX, X (X1, X2) = Thy, (X1) Tx, (X2). 


We also introduce the conditional PDFs 


TTX X: (X1, X2) 


Trx (X1lX2) = 
TTX (X2) 
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and 
TTX, X (X1, X2) 


TT% (X2|X1) = 
TTX, (x1) 


Example 3.7 (Gaussian joint distributions). A Gaussian joint distribution Tryy (x, Vy), 
x,y € R, with mean (x,y) and covariance matrix 


2 2 
=| 2 2 
Oyx Oyy 


leads to a Gaussian conditional distribution 


rrx(x|y) = EN, (3.1) 
c 


with conditional mean 
Xc =X + OxyOyy(Y - Y) 
and conditional variance 


of = Oky Oky yy Tys 
For a given y, we define X|y as the random variable with conditional probability 


distribution tx (x|y) and write X| y ~ N(Xe, og). 


1.2 Bayesian inference 


We start this section by considering transformations of random variables. A typical 
scenario is the following one. Given a pair of independent random variables = with 
values in Y = RX and X with values in X = RN together with a continuous map 
h : RN — RK, we define a new random variable 


Y =h(X)+E. (3.2) 


The map h is called the observation operator, yielding observed quantities given a par- 
ticular value x of the state variable X, and © represents measurement errors. 


Theorem 3.8 (PDF for transformed random variable). Assume that both X and © are 


absolutely continuous, then Y is absolutely continuous with PDF 


Try (y) = |r (y — h(x)) 1x(x)dx. (3.3) 
x 


If X is a deterministic variable, i.e. X = Xo for an appropriate xo € RY, then the PDF 
simplifies to 
Try (V) = Te (y — h (xo)) . 
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Proof. We start with X = xo. Then, Y — h(xo) = E which immediately implies the 
stated result. In the general case, consider the conditional probability 


Try (y|xo) = Te (y — h (xo)) . 
Equation (3.3) then follows from the implied joint distribution 
Tixy (X, Y) = Try (y|x) Ix (x) 


and subsequent marginalization, i.e. 


Try (y) = | mr. max s [miroo nwa. 
x x 


The problem of predicting the distribution try of Y given a particular configu- 
ration of the state variable X = xo is called the forward problem. The problem of 
predicting the distribution of the state variable X given an observation Y = Yo gives 
rise to an inference problem, which is defined more formally as follows. 


Definition 3.9 (Bayesian inference). Given a particular value yo € RK, we consider 
the associated conditional PDF Trx (x |o) for the random variable X. From 


Texy (X,Y) = Ty (y |x)mx(x) = rx (x|y)Tty (y), 


we obtain Bayes’ formula 


TEx (Yo|X) rx (x) 


Ty (Yo) ee 


Tx(x|yo) 


The object of Bayesian inference is to obtain Ttx (x | Yo). 


Since Try (vo) # 0 is a constant, equation (3.4) can be written as 
Tx (X|Vo) © Tx (Volx) Ix (x) = Tea (Yo — h(x)) rx(x), 


where the constant of proportionality only depends on yo. We denote by 7rx(x) the 
prior PDF of the random variable X and 1rx(x|¥o) the posterior PDF. The function 
Tt (¥o|x) is called the likelihood function. 

Having obtained a posterior PDF tx (x | yo), it is often necessary to provide an es- 
timate of a “most likely” value of x conditioned on yo. Bayesian estimators for x are 
defined as follows. 


Definition 3.10 (Bayesian estimators). Given a posterior PDF Trx(x|Yo), we define 


a Bayesian estimator X € X by 


x = arg minyrex | Lox’, x) rx (el yo)dx 
x 
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where L(x’, x) is an appropriate loss function. Popular choices include the maximum 
a posteriori (MAP) estimator, with X corresponding to the modal value of tx (x| yo). 
The MAP estimator formally corresponds to the loss function L(x’, x) = ljx’#x}. The 
posterior median estimator corresponds to L(x’, x) = ||x’ — x|| while the minimum 
mean square error estimator (or conditional mean estimator) 


X= | xrexcxlyo)ax 
x 
results from L(x’, x) = ||x’ — x||2. 
We now consider an important example for which the posterior can be computed 
analytically. 
Example 3.11 (Bayes’ formula for Gaussian distributions). Consider the case of ascalar 
observation, i.e. K = 1, with E ~ N (0, 0,). Then, 


1 -— (n(x)-y)° 
Ts(h(x) - = —— Pr z 
a(h(x)- y) T 
We also assume that X ~ N(X, P) and that h(x) = Hx. Then, the posterior distribu- 
tion of X given y = yo is also Gaussian with mean 


-1 
Xc = X- PH! (HPH! + 0o2,) (HX - yo) 
and covariance matrix 
-1 
P. = P - PH" (HPH" + 0f,) HP. 


These are the famous Kalman update formulas which follow from the fact that the 
product of two Gaussian distributions is also Gaussian, where the variance of Y = 
HX + is given by 
Cp y = HPH" + try 

and the vector of covariances between x € RN and y = Hx € Ris given by PH". For 
Gaussian random variables, the MAP, posterior median, and minimum mean square 
error estimators coincide and are given by Xc. The case of vector-valued observations 
will be discussed in Section 3.3. Finally, note that X. solves the minimization problem 


Fo . di. een ur, >} 
x. = arg min |>(x x) P™ (x x) + zp HX yor, 


xERN 


which can be viewed as a regularization of the ill-posed inverse problem 
yo=Hx, xeRï, Nol, 


in the sense of Tikhonov. A standard Tikhonov regularization would be based on 
P-! = ôI with the regularization parameter 6 > 0 appropriately chosen. In the 
Bayesian approach to inverse problems, the regularization term is instead determined 
by the Gaussian prior ry. 
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We mention in passing that Bayes’ formula has to be replaced by the Radon- 
Nikodym derivative in the case where the prior distribution is not absolutely contin- 
uous with respect to the Lebesgue measure (or in case the space X does not admit 
a Lebesgue measure). Consider, as an example, the case of an empirical measure ux 
centered about the M samples xj € X,i = 1,...,M, ie. a weighted sum of point 
measures given by 

1M 
ux(dx) = 55 2, Has (dx) 
Then, the resulting posterior measure ux(-|Yo) is absolutely continuous with respect 
to ux, i.e. there exists a Radon—Nikodym derivative such that 


dux (x|yo) 


PRES) Lx (dx) 


[ Fooux(axlyo) = f fo 
x x 


and the Radon-Nikodym derivative satisfies 


dux(x|yvo) 


duzt) œ Ts (h(x) — yo) . 


Furthermore, the explicit expression for the posterior measure is given by 
M 
ux(dx|yo) = X wi ux: (dx), 
i=1 


with weights w; > 0 defined by 
wi œ Ta (h(xi) — Vo) , 


and the constant of proportionality is determined by the condition en w;i=1. 


1.3 Coupling of random variables 


We have seen that under Bayes’ formula, a prior probability measure ux(-) on X is 
transformed into a posterior probability measure ux(-|Yo) on X conditioned on the 
observation yo = Y(w). With each of the probability measures, we can associate 
random variables such that, e.g. X] ~ ux and X2 ~ ux (-|Yo). However, while Bayes’ 
formula leads to a transformation of measures, it does not imply a specific transfor- 
mation on the level of the associated random variables; many different transforma- 
tions of random variables lead to the same probability measure. In this section, we 
will, therefore, introduce the concept of coupling two probability measures. 


Definition 3.12 (Coupling). Let ux, and ux, denote two probability measures on 
a space X. A coupling of ux, and ux, consists of a pair Z = (X1,X2) of random 
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variables such that X; ~ [y,, X2 ~ Ux,, and Z ~ uz. The joint measure uz on the 
product space Z = X x X is called the transference plan for this coupling. The set of 
all transference plans is denoted by Il(ux, , Ux, ). 


Here, we will discuss different forms of couplings, assuming that both the source 
and target distributions are explicitly known, whilst applications to Bayes’ formula 
(3.4) will be discussed in Sections 1.4 and 3. In practice, the source distribution of- 
ten needs to be estimated from available realizations of the underlying random vari- 
able X,. This is the subject of parametric and nonparametric statistics and will not 
be discussed in this survey paper. In the context of Bayesian statistics, knowledge 
of the source (prior) distribution and the likelihood implies knowledge of the target 
(posterior) distribution. 

Since prior distributions in Bayesian inference are generally assumed to be abso- 
lutely continuous, the discussion of couplings will be restricted to the less abstract 
case of X = RN and Lx, (dx) = Try, (x) dx, un; (dx) = Ty, (x)dx. In other words, 
we assume that the marginal measures are absolutely continuous. We cannot, howev- 
er, assume that the coupling is absolutely continuous on Z = X x X = R’N, Clearly, 
couplings always exist since one can use the trivial product coupling 


Tz (X1,X2) = Tx, (X1)TT8, (X2), 


in which case the associated random variables X] and X» are independent. The more 
interesting case is that of a deterministic coupling. 


Definition 3.13 (Deterministic coupling). Assume that we have a random variable X, 
with law ux, and a second probability measure ux,. A diffeomorphism T : X — X is 
called a transport map if the induced random variable X? = T (Xj) satisfies 


[rau tax) = | F Træ) un (ax) 
X x 


for all suitable functions f : X — R. The associated coupling 
Hz (dx1,dx2) = 6 (x2 - T(x1)) ux, (dx1)dx2, 


where 6(-) is the standard Dirac distribution, is called a deterministic coupling. Note 
that uz is not absolutely continuous, even if both ux, and ux, are. 


Using 
[ Foes (X - T(xe1)) dxe = f (Tx) , 
x 


it indeed follows from the above definition of uz that 


| Fux (de) = | Fuz(dxi, dxa) = [ free) un dn). 
X z X 


We discuss a simple example. 
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Example 3.14 (One-dimensional transport map). Let try, (x) = O and Tx, (x) > 0 
denote two PDFs on X = R. We define the associated cumulative distribution functions 
by 


x X. 
Fx, (x) = Í Tx, (x’)dx’, Fx (x)= Í Tx, (x )dx’. 
Since Fx, is monotonically increasing, it has a unique inverse F x (p)forp € [0,1]. 
The inverse may be used to define a transport map that transforms X, into X2 as 


follows, 
Xə = T(X1) = Fx} (Fx (X1)) . 


For example, consider the case where X; is a random variable with uniform distri- 
bution U([0,1]), and X> is a random variable with standard normal distribution 
N(0, 1). Then, the transport map between Xı and X? is simply the inverse of the cu- 
mulative distribution function 


x 
Pte) = he f eena, 


which provides a standard tool for converting uniformly distributed random numbers 
to normally distributed ones. 


We now extend this transform method to random variables in RN with N = 2. 


Example 3.15 (Knothe-Rosenblatt rearrangement). Let my, (x1, x?) and ry, (x!, x°) 


denote two PDFs on x = (x!,x*) € R?. A transport map between try, and Try, 
can be constructed in the following manner. We first find the two one-dimensional 
marginals Tx! (x!) and Tiy] (x!) of the two PDFs. In the previous example, we have 
seen how to construct a transport map Al = Tı(X 1) which couples these two one- 
dimensional marginal PDFs. Here, x} denotes the first component of the random 
variables X;, i = 1, 2. Next, we write 


14 2,,1 1 148 Žiji 1 
mex, (21,27) = mg (x? Le!) eg (x1) 5 mex (ta) = tg (x? 1") eg (a) 


and find a transport map X$ = T2(X},X¢) by considering one-dimensional cou- 
plings between Try: (x?|x!) and Thy? (x*|T(x!)) with x! fixed. The associated joint 
distribution is given by 


Tz (eh ie owe) =6 (x! -Tı (x1) ô (x3 -T (ghee) TTX: (21,8) : 


This is called the Knothe-Rosenblatt rearrangement, also well known to statisti- 
cians under the name of conditional quantile transforms. It can be extended to RN, 
N = 3 in the obvious way by introducing the conditional PDFs 


Try’ Ber) Tg bel) ; 
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and by constructing an appropriate map X? = T3(X}, X?, X?) from those conditional 
PDFs for fixed pairs (x1, x?) and (x},x3) = (Tı(x1),T2(x1,x7)) etc. While the 
Knothe-Rosenblatt rearrangement can be used in quite general situations, it has the 
undesirable property that the map depends on the choice of ordering of the variables, 
i.e. in two dimensions a different map is obtained if one instead first couples the x? 
components. 


Example 3.16 (Affine transport maps for Gaussian distributions). Consider two Gaus- 
sian distributions N(X 1, =) and N(X2, =2) in RN with means Xı and X» and covari- 
ance matrices =; and =», respectively. We first define the square root =!/? of a sym- 
metric positive definite matrix £ as the unique symmetric positive definite matrix 
which satisfies 31/221/2 = 5, Then, the affine transformation 


x2 = T(x1) = X2 +22 XP (x1 — 1) (3.5) 
provides a deterministic coupling. Indeed, we find that 

(x2 — X2)" Ez? (x2 -%2) = (x1 - X1)" 2 (x1 — X1) 
under the suggested coupling. The proposed coupling is, of course, not unique since 


% = T(x1) = X2 + 53 a2] a -Xı), 


where Q is an orthogonal matrix, and also provides a coupling. We will see in Sec- 
tion 3.3 that a coupling between Gaussian random variables is also at the heart of the 
ensemble square root filter formulations of sequential data assimilation. 


Deterministic couplings can be viewed as a special case of a Markov process 
{Xn }ne{1,2} defined by 
Tx, (X2) = Í Tt (x2|x1)Ttx (x1)dx1, 
Xı 
where Tr (x2|x1) denotes an appropriate conditional PDF for the random variable X» 
given Xı = x1. Indeed, we simply have 


T(X2|X1) = 6 (x2 - T(x1)) 


for deterministic couplings. We will come back to Markov processes in Section 2. 
The trivial coupling TTz (x1,X2) = Trx, (X1)TTx,(X2) leads to a zero correlation 
between the induced random variables X} and X; since their covariance is 


cov(X), X2) = Ez fe — X1) (x2 - 2)" | = Ez [xix3| — XX =0, 


where X; = Ex,[x]. A transport map leads instead to the covariance matrix 


cov(X1,X2) = Ez [xix3 | — Ex, [x1] (Ex, [x2])" = EX [x1 T(x)" | -XıX%, 
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which is nonzero in general. If several transport maps exist, then one could choose 
the one that maximizes the covariance. Now consider, for example, univariate ran- 
dom variables X; and X». Maximizing their covariance for given marginal PDFs has 
an important geometric interpretation: it is equivalent to minimizing the mean square 
distance between x; and T(x ) = x2 given by 


Ez [1x2 -xıl?] = Ex, [1x117] + Ex, [1x217] — 2Ez [xıx2] 
= Ex, [1x112] + Ex, [1x21] — 2Ez[(x1 — Xi) (x2 - X2)] - 2X1X2 


= Ex [ix] + Ex, [1x217] — 2X1 X? — 2cov(X1, X2). 


Hence, finding a joint measure uz that minimizes the expectation of (x1 — x2)? si- 


multaneously maximizes the covariance between X; and X». This geometric interpre- 
tation leads to the celebrated Monge-Kantorovitch problem. 


Definition 3.17 (Monge-Kantorovitch problem). A transference plan už € II(ux,, Ux.) 
is called the solution to the Monge-Kantorovitch problem with cost function 
c(x1,%2) = lla - all? if 


Mz = arginfy,en(ux, ux) Ez [loc = x2ll?] . 3.6) 


The associated function W (ux, Hx), defined by 


* 


W (ux, Hx)? = Ez [lx = xal?] ‚ law(Z)= uz, 


is called the L?-Wasserstein distance of py, and Hx. 


Theorem 3.18 (Optimal transference plan). Ifthe measures ux,,i = 1, 2, are absolute- 
ly continuous, then the optimal transference plan that solves the Monge-Kantorovitch 
problem corresponds to a deterministic coupling with transfer map 


X2 = T(X1) = Vxy(X1) 
for some convex potential y : RN — R. 


Proof. We only demonstrate that the solution to the Monge-Kantorovitch problem is 
of the desired form when the infimum in (3.6) is restricted to deterministic couplings. 
See [33, 53] for a complete proof based on approximative couplings using linear pro- 
gramming, the geometric concept of cyclical monotonicity of the support of an opti- 
mal coupling, and Rockafellar’s theorem. 

We denote the associated PDFs by Trx,, i = 1, 2. We also introduce the inverse 
transfer map Xı = S(X2) = T~!(X>2) and consider the functional 


LIS,Y]= 5 Í S(x) — x l?rrx, (x) dx 
RN 


+ Í [F (S(x)) Tx (xX) - ¥ (x) Tx, (x)] dx 
RN 
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in S and a potential ¥ : RN — R. We note that 


| Ir soon. -Yom (0)] dx 
RN 
= | wir) [rex (T) IDT@I = Mex, (x)] dx 
RN 
by a simple change of variables. Here, |DT (x)| denotes the determinant of the Jaco- 
bian matrix of T at x and the potential Y can be interpreted as a Lagrange multiplier 


enforcing the coupling of the two marginal PDFs under the desired transport map. 
Taking variational derivatives with respect to S and Y, we obtain 


= = Ty% (x) [(S(x)- x) + Vx¥ (S(x))]=0 
and 
ôL 
a~ -Tx, (X) + Tx, (T(x)) |DT(x)| = 0, (3.7) 


characterizing critical points of the functional £. The first equality implies 
1 
X2 = X1 + Vx¥ (x1) = Vx (zafaı +Y œ) =: VxW (x1) 


and the second recovers our Ansatz that T transforms Try, into Try, as a result of the 
Lagrange multiplier Y. 


Example 3.19 (Optimal transport maps for Gaussian distributions). Consider two Gaus- 
sian distributions N(X1, 1) and N(X2, =2) in RN with means Xı and X>, and covari- 
ance matrices ©; and =p, respectively. We had previously discussed the deterministic 
coupling (3.5). However, the induced affine transformation x2 = T(x ) cannot not 
be generated from a potential y since the matrix z > * is not symmetric. In- 
deed the optimal coupling in the sense of Monge-Kantorovitch with cost function 
c(x1,X2) = ||x1 — x2 ||? is provided by 

x2 = Ta) = Xo + Eh? [er] xh? On — m1) . (3.8) 
See [41] for a derivation. The following generalization will be used in Section 3.3. As- 
sume that a matrix A € RN*™ is given such that =» = A AT. Clearly, we can chose 
A= Er * in which case M = N and A is symmetric. However, we allow for A to be 
nonsymmetric and M can be different from N. An important observation is that one 
can replace Er * in (3.8) by A and AT, respectively, i.e. 


T(x) = Xo +A [ATLA ie (x1 —X4) . (3.9) 


While optimal couplings are of broad theoretical and practical interest, their 
computational implementation can be very demanding. In Section 3, we will discuss 
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an embedding method originally due to Jürgen Moser [38], which leads to a gener- 
ally nonoptimal, but computationally more tractable formulation in the context of 
Bayesian statistics and data assimilation. Alternatively, we may replace the coupling 
problem by an appropriate finite-dimensional linear programming problem [46]. 


1.4 Monte Carlo methods 


Monte Carlo methods, also called particle or ensemble methods depending on the 
context in which they are being used, can be used to approximate statistics, name- 
ly, expectation values Ex[ f], for a random variable X. We begin by discussing the 
special case f(x) = x, namely, the mean. 


Definition 3.20 (Empirical mean). Given a sequence X;, i = 1,...,M, of independent 
random variables with identical measure ux, the empirical mean is 


ie LS 
XM = m 2 Xilw) = uai 
i=1 i=1 
with samples x; = X; (w). 


Of course, Xy itself is the realization of a random variable X and we consider 
the mean squared error (MSE) 


MSE(X) = Ex, [Xu - X)7] 
(3.10) 


~ 


Ex, [Xm] - X)? + Ex, | Œm — Ex,,[%m])? | 


with respect to the exact mean value X = Ex[x]. We have broken down the MSE 
into two components: squared bias and variance. Such a decomposition is possible 
for any estimator and is known as the bias-variance decomposition. The particular 
estimator Xm is called unbiased since Ex, [Xm] = X for any M > 1. Furthermore, 
Xm converges weakly to X under the central limit theorem provided ux has finite 
second-order moments, i.e. 


dm Xm [ (em = Ex, (m1)? | =0. 


It remains to generate samples x; = Xi(w) from the required distribution. Meth- 
ods to do this include the von Neumann rejection method and Markov chain Monte 
Carlo methods, which we will briefly discuss in Section 2. Often, the prior distribu- 
tion is assumed to be Gaussian, in which case explicit random number generators are 
available. We now turn to the situation where samples from the prior distribution are 
available, and are to be used to approximate the mean of the posterior distribution 
(or any other expectation value). 

Importance sampling is a classical method to approximate expectation values of 
a random variable X! ~ Try: using samples from a random variable XP ~ Tryp, which 
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requires that the target PDF 7ry: is absolutely continuous with respect to proposal PDF 
Trxr. This is the case for the prior and posterior PDFs from Bayes’ formula (3.4), i.e. 
we set the proposal distribution Trxr (x) equal to the prior distribution tx (x) and the 
posterior distribution Try (x|Yyo) o Try (Yo|x)Trx (x) becomes the target distribution 
Tryt (xX). 


Definition 3.21 (Importance sampling for Bayesian estimation). Let xP” i =1,...,M, 
denote samples from the prior PDF 7rx (x), then the importance sampler estimate of 
the mean of the posterior Ttx(x|Yo) is 


M 
i => oe (3.11) 
i=1 
with importance weights 
Try (yo p) 


Wi (3.12) 


Zila ny (ol) | 

Importance sampling becomes statistically inefficient when the weights have 
largely varying magnitude which becomes particularly significant for high-dimen- 
sional problems. To demonstrate this effect, consider a uniform prior on the unit 
hypercube V = [0,1]^. Each of the M samples x; from this prior formally repre- 
sent a hypercube with volume 1/M. However, the likelihood measures the distance 
of a sample x; to the observation Yo in the Euclidean distance and the volume of 
a hypersphere decreases rapidly relative to that of an associated hypercube as N in- 
creases. Within the framework of the bias-variance decomposition of a mean squared 
error, for example, (3.10), the curse of dimensionality manifests itself in large vari- 
ances for finite M. 

To counteract this curse of dimensionality, one may utilize the concept of cou- 


pling. In other words, assume that we have a transport map xP°St = T(xPrior) 
which couples the prior and posterior distributions. Then, with transformed sam- 
ples port = Ta) i = 1,...,M, we obtain the estimator 

M 


—post a — post 
XM = > Wix; 
i=l 


with equal weights w; = 1/M. 

Sometimes, one cannot couple the prior and posterior distribution directly, or 
the coupling is too expensive computationally. Then, one can attempt to find a cou- 
pling between the prior PDF 7tx(x) and an approximation fry (x| yo) to the posteri- 
or PDF Trx(x|¥o) © Try(Yyolx)Trx (x). Given an associated transport map XPP = 
T(xpror), i.e. 

ftx (T(x) |¥o) = mx ID, 
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one then takes 7tx(x|7o) as the proposal density Trx» (x) in an importance sampler 
with realizations x?"°?, i = 1,...,M, defined by 


ro z rior 
ee T (aT ) . 


An asymptotically unbiased estimator for the posterior mean is now provided by 


M 
we = ee (3.13) 
i=1 
with weights 
prop prop prop 
. my (voix; ) nX (xi ) Ty (vvolx?"?) of Ga) Tx E ) 
i = i i ; ’ 
ftx Gao lyo) l t TTX C 
(3.14) 
i = 1,...,M. The constant of proportionality is chosen such that En w; = 1. In- 


deed, if Txs (x) = Trx(x|yo) = Trx(x|Yo), we recover the case of equal weights 
w; = 1/M, and Ttxr (x) = Ttx(x|yo) = Ttx(x) leads to standard importance sam- 
pling using prior samples, i.e. xP" = xP" 

We will return to the subject of sampling from the posterior distribution in Sec- 
tions 2.3 and 3.2. 


References 


An excellent introduction to many topics covered in this survey is [22]. Bayesian infer- 
ence and a Bayesian perspective on inverse problems are discussed in [24, 31, 39]. The 
monographs [52, 53] provide an in depth introduction to optimal transportation and 
coupling of random variables. Monte Carlo methods are covered in [32]. We also point 
to [20] for a discussion of estimation and regression methods from a bias-variance 
perspective. A discussion of infinite-dimensional Bayesian inference problems can 
be found in [51]. 


2 Stochastic processes 


In this section, we collect basic results concerning stochastic processes. 


Definition 3.22 (Stochastic process). Let T be a set of indices. A stochastic process is 
a family {X:;}rer of random variables on a common space X, i.e. Xı(w) EX. 


In the context of dynamical systems, the variable t corresponds to time. We dis- 
tinguish between continuous time t € [0, tena] C R or discrete time tn = nAt, 
n € {0,1,2,...} = T, with At > 0 a time-increment. In cases where subscript in- 
dices can be confusing, we will also use the notations X(t) and X (tn), respectively. 
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A stochastic process can be seen as a function of two arguments: t and w. For 
fixed w, X+ (w) becomes a function oft € T, which we call a realization or trajectory 
of the stochastic process. We will restrict ourselves to the case where X; (w) is con- 
tinuous in t (with probability 1) in the case of a continuous time. Alternatively, one 
can fix the time t € T and consider the random variable X;(-) and its distribution. 
More generally, one can consider I-tuples (tı,t2,...,tı) and associated l-tuples of 
random variables (X;,(-), Xt(-),...,Xt,(-)) and their joint distributions. This leads 
to concepts such as temporal correlation. 


2.1 Discrete time Markov processes 


First, we develop the concept of Markov processes for discrete time processes. 


Definition 3.23 (Discrete time Markov processes). The discrete time stochastic pro- 
cess {Xn}ner with X = RN and T = {0,1,2,...) is called a (time-independent) 
Markov process with transition kernel tr (x’ |x) if its joint PDFs can be written as 


Tin (X0, X1; ---3 Xn) = T (Xn|Xn-1) T (Xn-1|Xn-2) <- TT (X1|X0) To (Xo) 


forall n € {0,1,2,...} = T. The associated marginal distributions Tin = Try,, satisfy 
the Chapman-Kolmogorov equation 


Tin+1 (X) = |r (x |x) Tin (x) dx (3.15) 
RN 
and the process can be recursively repeated to yield a family of marginal distributions 
{Tin tner for given Tro. This family can also be characterized by the linear Frobenius- 
Perron operator 
Tini = PT, (3.16) 


which is induced by (3.15). 


The above definition is equivalent to the more traditional definition that a process 
is Markov if the conditional distributions satisfy 


Tin (XnlX0, X1; -- -3 Xn-1) = T (XnlXn-1) - 


Note that, contrary to Bayes’ formula (3.4), which directly yields marginal distri- 
butions, the Chapman-Kolmogorov equation (3.15) starts from a given coupling 


MXn+1Xn (Xn+1: Xn) = T (Xn+1|Xn) Tx, (Xn) 


followed by marginalization to derive Try,,,, (Xn+1). A Markov process is called time- 
dependent if the conditional PDF 7r(x’|x) depends on tn. While we have considered 
time-independent processes in this section, we will see in Section 3 that the idea of 
coupling applied to Bayes’ formula leads to time-dependent Markov processes. 
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2.2 Stochastic difference and differential equations 


We start from the stochastic difference equation 
Xn+1 = Xn + Atf (Xn) + V2AtZyn, tne =tn + At, (3.17) 


where At > 0 is a small parameter (the step-size), f is a given (Lipschitz continu- 
ous) function, and Zn ~ N(0, Q) are independent and identically distributed random 
variables with correlation matrix Q. 

The time evolution of the associated marginal densities Trx, is governed by the 
Chapman- Kolmogorov equation with conditional PDF 


1 
(4r At)N/2|Q|1/2 


x exp Em (x’ — x — Atf(x))'Q7! (x’ -x- Atf(x)) . 


T(x’ |x) 
(3.18) 


Proposition 3.24 (Stochastic differential and Fokker-Planck equation). Taking the 
limit At — 0, one obtains the stochastic differential equation (SDE) 


dX; = f (X+) dt + V2Q 1° aw (3.19) 


for Xt, where {Wt}t>0 denotes standard N-dimensional Brownian motion, and the 
Fokker-Planck equation 


OTTx 


Ot = -Vx (tx f) + Vx + (QV xTtx) (3.20) 


for the marginal density Try (x,t). Note that Q = 0 (no noise) leads to the Liouville, 
transport or continuity equation 


OTTx 


at 7 =Ve~ (tet); (3.21) 


which implies that we may interpret f as a given velocity field in the sense of fluid me- 
chanics. 


Proof. The difference equation (3.17) is called the Euler-Maruyama method for ap- 
proximating the SDE (3.19). See [21, 26] for a discussion on the convergence of (3.17) 
to (3.19) as At — 0. 

The Fokker-Planck equation (3.20) is the linear combination of a drift and a dif- 
fusion term. To simplify the discussion, we derive both terms separately from (3.17) 
by first considering f = 0,Q + Oand then Q = 0, f + 0. To simplify the derivation 
of the diffusion term even further, we also assume x € Rand Q = 1. In other words, 
we show that scalar Brownian motion 


dx; = /2dW; 
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leads to the heat equation 
OTTx = 0°try 


Ot 0x2 
We first note that the conditional PDF (3.18) reduces to 
m(x |x) = (47rAt)~!/? exp _ (x! =x)" 
4At 


under f(x) = 0,Q = 1,N = 1, and the Chapman-Kolmogorov equation (3.15) be- 
comes 


1 = A 
Tn+1(X’) = Í ey AAN Tra (x + y)dy (3.22) 
2 V4TAt 


under the variable substitution y = x — x’. We now expand Tt„(x’ + y) in y about 
y = 0,i.e. 


Y? Tn 
2 0x2 


F r OTT. ti 
Tin (X +V) = Tn (x’) E (x) 


+, 


and substitute the expansion into (3.22): 


1 2 
Tins (X) = Í gae? Tn (x') dy 
Arr At 
TT 


1 2 OTT. 
J -y“/(4At) n , 
| AnA 2 Ox (x") dy 


ATtAt 2 0x? ee 


-f l 0-?1440, 9° D Tn 
R 

The integrals correspond to the zeroth, first and second-order moments of the Gaus- 

sian distribution with mean zero and variance 2At. Hence, 


d? Tin 
0x? 


(x’) ge 


Tin+1 (X) = Tm (x’) + At 


and it can also easily be shown that the neglected higher-order terms contribute with 
O(At?) terms. Therefore, 


Tn+i (X) - Tn (X) _ ern 
At Ox? 


(x’) + O(At) , 


and the heat equation is obtained upon taking the limit At — 0. The nonvanishing 
drift case, ie. f(x) # 0, while being more technical, can be treated in the same 
manner. 

One can also use (3.7) to derive Liouville’s equation (3.21) directly. We set 


T(x)=x+Atf(x) 
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and note that 
|[DT(x)| =1+AtVx + f + O(At?). 

Hence, (3.7) implies 

Tx, = Tx, + Attty,Vx f + At (Vxttx,) - f + O(At?) 
and 

TUX) — TUX, 
At 

Taking the limit At — 0, we obtain (3.21). 


= -Vx- (muß) + © (At). 


Following the work of Felix Otto (see, e.g. [42, 52]), we note that in the case of 
pure diffusion, i.e. f = 0, the Fokker—Planck equation can be rewritten as a gradient 
flow system. We first introduce some notation. 


Definition 3.25 (differential geometric structure on manifold of probability densities). 
We formally introduce the manifold of all PDFs on X = RN 


m- fr Remo > 0, [mines =a] 
RN 


with tangent space 


m= foia" =n: [oa]. 


RN 
The variational derivative of a functional F : M — Ris defined as 


OF . F (+e) — F(t) 
er lim € 


RN 
where ¢ is a function such that [xx pdx = 0, i.e. p € TrM. 


Consider the potential 


V (Tex) = Í Txintydx, (3.23) 
RN 


which has the functional derivative 


ay = lnr. 
OTTx z ` 


since 
V (1x +E) = V (Trx) +€ | (plntry + $) dx + Ole?) 
RN 


=V (m) +é Í plnTydx + 0(e?). 
RN 
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Hence, we find that the diffusion part of the Fokker-Planck equation is equivalent to 


OV 


m = Vx: (QVxtx) = Vx: frav] . (3.24) 


ot 
This formulation allows us to treat diffusion in form of a vector field 


ôV 
‚t) = -QVx—— 
v(x, t) QVx Sma 
which, contrary to vector fields arising from the theory of ordinary differential equa- 
tions, depends on the PDF Try. See the following Section 2.3 for an application. 


Proposition 3.26 (Gradient on the manifold of probability densities). Let g be a met- 
ric tensor defined on Tr M as 


E EE Í (evi Wm ndk 


RN 


with potentials Wi, i = 1,2, determined by the elliptic partial differential equation 
(PDE) 
-Vx + (TMV Wi) = i, 


where M € RN*N is a symmetric, positive-definite matrix. 
Then, the gradient of a potential F (Tt) under gr satisfies 


OF 
grad,,F(1) = -Vx ` (mv. S=) (3.25) 
Proof. Given the metric tensor gr, the gradient is defined by 
OF 
In (grad F (TT), $) = i gm POX (3.26) 


RN 
for all p € TrM. Since any element $ € TrM can be written in the form 
p = -Vx + (nMVxy) 
with suitable potential y, a potential W exists such that 
grad F(T) = -Vx - (m™MVx@) € TrM 


and we need to demonstrate that 
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is consistent with (3.26). Indeed, we find that 


OF OF 
snt a Vx + (TMV xW) dx 
RN N 


R 
OF 
RN 


= | (Ve) - (MV) mar 
RN 
= Gr (gradF (T7), p) . 


It follows that the diffusion part of the Fokker-Planck equation can be viewed as 
a gradient flow on the manifold M. More precisely, set F (TT) =V(rx)andM=Qto 
reformulate (3.24) as a gradient flow 


OTTx 
ot 
with potential (3.23). We will find in Section 3 that related geometric structures arise 
from Bayes’ formula in the context of filtering. We finally note that 


dv OV OITtx 
dt J rmx ôt 
RN 


= —grad,,, V (Trx) 


OV OV 
R 


2.3 Ensemble prediction and sampling methods 


In this section, we extend the Monte Carlo method from Section 1.4 to the approxi- 
mation of the marginal PDFs try(x,t), t > 0, evolving under the SDE model (3.19). 
Assume that we have a set of independent samples x;(0), i = 1,...,M, from the 
initial PDF try (x, 0). 


Definition 3.27 (ensemble prediction). A Monte Carlo approximation to the time- 
evolved marginal PDFs try (x, t) can be obtained from solving the SDEs 


dx; = f (xi) dt + V2Q)/*dwi(t) (3.27) 


for i = 1,...,M, where {W;( apes ; denote realizations of independent standard N- 
dimensional Brownian motion and the initial conditions {x;(0) pE q are realizations 
of the initial PDF tx (x, 0). This approximation provides an example for a particle or 


ensemble prediction method and it can be shown that the estimator 


1 M 
Xm(t) = 35 X xlt) (3.28) 
i=1 


provides a consistent and unbiased approximation to the mean Ex, [x]. 


Ensemble filter techniques for intermittent data assimilation — 113 


Alternatively, using formulation (3.24) of the Fokker-Planck equation (3.20) in 
the pure diffusion case, we may reformulate the random part in (3.27) and introduce 
particle equations 


dx; SV 
m Ff (xi) - QVx— (xi) 
; = (3.29) 
Ff (xi) m e a (xD), 


i = 1,...,M. Contrary to the SDE (3.27), this formulation requires the PDF try (x, t), 
which is not explicitly available in general. However, a Gaussian approximation can 
be obtained from the available ensemble x;(t), i = 1,..., M, using 


1 1 = u = 
RE) = rege P (—5 x -FM OTP ~ Xu (t))) 


with empirical mean (3.28) and empirical covariance matrix 


ee et a 
et, Xm) (Xi-XM) . (3.30) 


Substituting this Gaussian approximation into (3.29) yields the ensemble evolution 
equations 


ap ot Oe QP! (xi - Xm) , (3.31) 


which becomes exact in case the vector field f is linear, i.e. f(x) = Ax + u, the 
initial PDF Trx (x, 0) is Gaussian and for ensemble sizes M — œ. 

We finally discuss the application of a particular type of SDEs (3.19) as a way of 
generating samples x; from a given PDF such as the posterior Try (x | yo) of Bayesian 
inference. To do this, consider the SDE (3.19) with the vector field f being generated 
bya potential U (x), i.e. f(x) = -VxU(x), and Q = I. Then, it can easily be verified 
that the PDF 


1 (x) = Z~! exp(-U(x)), Z= | exp uw ax 
RN 


is stationary under the associated Fokker-Planck equation (3.20). Indeed, 
Vx- (TWEVXU) + Vx- Vets = Vx- (TE VXU + Vxš)=0. 


Furthermore, it can be shown that 7r¥ is the unique stationary PDF and that any ini- 
tial PDF ırx(t = 0) approaches rr% at an exponential rate under an appropriate as- 
sumption on the potential V. Hence, X; ~ x fort — œ. This allows us to use 
an ensemble of solutions x;(t) of (3.27) with an arbitrary initial PDF try(x,0) as 
a method for generating ensembles from the prior or posterior Bayesian PDFs pro- 
vided U(x) = -Inrtx(x) or U(x) = -Inrttx(x|Yo), respectively. Note that the 
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temporal dynamics of the associated SDE (3.19) is not of any physical significance 
in this context, but instead the SDE formulation is only taken as a device for generat- 
ing the desired samples. If the SDE formulation is replaced by the Euler-Maruyama 
method (3.17), time-stepping errors lead to sampling errors which can be corrected 
for by combining (3.17) with a Metropolis accept-reject criterion. The Metropolis ad- 
justed method gives rise to particular instances of Markov chain Monte Carlo (MCMC) 
methods, for example, the Metropolis adjusted Langevin algorithm (MALA) or the hy- 
brid Monte Carlo (HMC) method. The basic idea of MALA (as well as HMC) is to rewrite 
(3.17) with f(x) = —VxU(x), Q =I as 


1 
Pn+1/2 = Pn- 9 V2AtV xU (Xn), (3.32) 
Xn+1 = Xn + V2Atpn+1/2; (3.33) 
1 
Pn+1 = Pn+1/2 = 9 V2AtVxU (pn), (3.34) 


having introduced a dummy momentum variable p with pn being a realization of the 
random variable Zn ~ N(0, I). Under the Metropolis accept-reject criterion, xn+1 is 
accepted with probability 


min {1, exp (- (En+1 —En))} , 


where 
De l T 
En = 5PnPn+U(Xn), Ensi = 5Pn+1Pn+1 + U(Xn+1) 


are the initial and final energies. Upon rejection, one continues with x„. The momen- 
tum value pn+1 is discarded after a completed time-step (regardless of its acceptance 
or rejection) and a new momentum value is drawn from N(0, J). It should however 
be noted that |En+1 — En| > 0 as the step-size At goes to zero, and, in practice, the 
application of the Metropolis accept-rejection step is often not necessary unless At 
is chosen too large. The HMC method differs from MALA in that several iterations of 
(3.32-3.34) are applied before the Metropolis accept-reject criterion is being applied. 
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A gentle introduction to stochastic processes can be found in [17] and [10]. A more 
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3 Data assimilation and filtering 


In this section, we combine Bayesian inference and stochastic processes to tackle the 
problem of assimilating observational data into scientific models. 


3.1 Preliminaries 


We select a model written as a time-discretized SDE, such as (3.17), with the initial 
random variable satisfying Xo ~ Tro. In addition to the pure prediction problem of 
computing Tn, n > 1, for given Tro, we assume that model states x € X = RN 
are partially observed at equally spaced instances in time. These observations are to 
be assimilated into the model. More generally, intermittent data assimilation is con- 
cerned with fixed observation intervals Atops > 0 and model time-steps At such that 
Atobs = LAt, L = 1, which allows one to take the limit L — œ, At = Atops/L > 0. 
For simplicity, we will restrict the discussion to the case where observations Vo(tn) = 
Yn(w) € RX are made at every time step tn = nAt, n = 1 and the limit At — O is 
not considered here. We will further assume that the observed random variables Y,, 
satisfy the model (3.2), i.e. 
Yn = A(Xn) + En 


and the measurement errors En ~ N(0, R) are mutually independent with common 
error covariance matrix R. We introduce the notation Yg = {¥o(ti)}i=1.....k to denote 
all observations up to and including time tx. 


Definition 3.28 (Data assimilation). Data assimilation is the estimation of marginal 
PDFs Trn (x|Y,) of the random variable Xn = X(t„) conditioned on the set of obser- 
vations Yx. We distinguish three cases: (i) filtering k = n, (ii) smoothing k > n, and 
(iii) prediction k < n. 


The subsequent discussions are restricted to the filtering problem. We have al- 
ready seen that evolution of the marginal distributions under (3.17) alone is governed 
by the Chapman-Kolmogorov equation (3.15) with transition probability density 
(3.18). We denote the associated Frobenius-Perron operator (3.16) by Pat. Given 
Xo ~ To, we first obtain 

Ti = Pattto. 


This time propagated PDF is used as the prior PDF Trx = Tr in Bayes’ formula (3.4) at 
t = tı with yo = Yo(tı) and likelihood 


1 1 
m = rare ex (5 (9 -ROR (y-hiw))). 


(211 


Bayes’ formula implies the posterior PDF 


Ty (x|Y1) © Try (Yo (t1) |x) mH (x), 
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where the constant of proportionality only depends on Yo(tı). 
Proposition 3.29 (Sequential filtering). The filtering problem leads to the recursion 
Tin+1 (-l¥n) = Patttn (-1Yn) , 
(3.35) 
Tn+1 (X|Yn+1) © Try (Yo (tn+1) IX) Tn+1 (X |Yn) , 


n = 0, and Xn ~ Tin(-|Yn) solves the filtering problem at time tn. The constant of 
proportionality only depends on yo(tn+1). 


Proof. The recursion follows by induction. 


Recall that the Frobenius-Perron operator Pat is generated by the stochastic dif- 
ference equation (3.17). On the other hand, Bayes’ formula only leads to a transition 
from the predicted 17,41(x|Yn) to the filtered TT„+1(X|Yn+1). Following our discus- 
sion on transport maps from Section 1.3, we assume the existence of a transport map 
X’ = Tn+ı(X), depending on yo(tn+1), that couples the two PDFs. The use of opti- 
mal transport maps in the context of Bayesian inference and intermittent data assim- 
ilation was first proposed in [37, 43]. 


Proposition 3.30 (Filtering by transport maps). Assuming the existence of appropriate 
transport maps Tn+1, which couple Ttn+1(xX|Yn) and Ttn+1(x|Yn=+1), the filtering prob- 
lem is solved by the following recursion for the random variables Xn+1,n = 0: 


Xnvt = Tny (Xn + Atf (Xn) + V2AtZn) , (3.36) 


which gives rise to a time-dependent Markov process. 


Proof. Follows trivially from (3.35). 


The rest of this section is devoted to several Monte Carlo methods for sequential 
filtering. 


3.2 Sequential Monte Carlo method 


In our framework, a standard sequential Monte Carlo method, also called bootstrap 
particle filter, may be described as an ensemble of random variables X; and associ- 
ated realizations (referred to as “particles”) x; = X;(w), which follow the stochastic 
difference equation (3.17), choosing the transport map in (3.36) to be the identity map. 
Observational data is taken into account using importance sampling as discussed in 
Section 1.4, i.e. each particle carries a weight w;(t„), which is updated according to 
Bayes’ formula 
Wwiltn+1) © Wiltn) T (Yol(tn+1)lXiltn+1)) - 


The constant of proportionality is chosen such that the new weights {wi(tn+1) pe 1 
sum to one. 
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Whenever the particle weights w;(t„) start to become highly nonuniform (or pos- 
sibly also after each assimilation step), resampling is necessary in order to generate 
a new family of random variables with equal weights. 

Most available resampling methods start from the weighted empirical measure 


M 
Ux(dx) = 2, winx (dx) (3.37) 
i=1 
associated with a set of weighted samples {x;, w;}'/ ,. The idea is to replace each of 
the original samples x; by §i > 0 offsprings with equal weights w; = 1/M such that 
E[&;] = w;M. The distribution of offsprings is chosen to be equal to the distribution 
of M samples (with replacement) drawn at random from the empirical distribution 
(3.37). In other words, the offsprings {&; Pr | follow a multinomial distribution defined 
by 


P i= 1,...,M L 1 = (3.38) 
(Si =ni,t = eae | ee a 


with ni > 0 such that En ni = M. In practice, independent resampling is often re- 
placed by residual or systematic resampling. We next summarize residual resampling 
while we refer the reader to [3] for an algorithmic description of systematic resam- 
pling. 


Definition 3.31 (Residual resampling). Residual resampling generates 
Ei =|Mwil+&i, 


offsprings of each ensemble member x; with weight w;, i = 1,...,M. Here, |x] 
denotes the integer part of x and a follows the multinomial distribution (3.38) with 
weights w; being replaced by 
os Mw; - |Mw;]| 
Wi M 
Èj- (Mw; - IMw;}) 


and with a total of 


"x 
© 
Í 


M:=M- >IMw;| 
i=1 i 


independent trials. 


In generalization of (3.38), we introduce the notation Mult(L; wı,..., wm) to de- 
note the multinomial distribution of L independent trials, where the outcome of each 
trial is distributed among M possible outcomes according to probabilities {w; re I 
The following algorithm draws random samples from Mult(L; wı,..., Wy). We first 
introduce the generalized right inverse 


i-1 i 
po =i = we (5 w;,>. w 
j=l j=l 
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of the cumulative distribution function Faas : [0,1] — {1,...,M} for the empirical 
measure (3.37). We next draw L independent samples u; € [0,1] from the uniform 
distribution U([0, 1]) and initially set the number of copies E; i = 1,...,M, equal 
to zero. For l = 1,..., L, we now increment En by one for indices I; € {1,..., M}, 
l= 1,..., L, defined by 
i 
Iı = Fp (u1) = arg min d wjzu. 
zl 52] 

Both independent and residual resampling can be viewed as providing a cou- 
pling between the empirical measure (3.37) with all weights being equal to w; = 1/M 
and the target measure (3.37) with identical samples {x;}, but nonuniform weights. 
Clearly, residual resampling provides a coupling with a smaller transport cost. This 
can already be concluded from the trivial case of equal weights in the target measure 
in which case residual resampling reduces to the identity map with zero transport 
cost, while independent resampling remains nondeterministic and produces a nonze- 
ro transport cost. The following example outlines the optimal transportation perspec- 
tive on resampling more precisely for two discrete, univariate random variables. 


Example 3.32 (Coupling discrete random variables). Let us consider two discrete, 
univariate random variables X; : Q > X, i = 1, 2, with target set 


X = {X1,X2,...,Xm} € R“. 

We furthermore assume that 
P(X (w) = xi)=1/M, P(X:(w) = xi) = wi 

for given probabilities/weights w; > 0, i = 1,...,M. Any coupling of X] and X? is 
characterized by a matrix T € R“*™ such that ti; = (T)ij > 0 and 

M M 

$ tj =1/M, > tye wi 

i=1 jel 


Given a coupling T and the mean values 


| 


1= yÈ“ X2 = 2 wi 
the covariance between Xı and X? is defined by 
cov(X1,X2) = > (xi — X2)tij(xj — X1). 
ij 
The induced Markov transition matrix from X to X2 is simply given by MT. Inde- 


pendent resampling corresponds to t;; = w;/M and leads to a zero correlation be- 
tween X| and X». On the other hand, maximizing the correlation results in a linear 
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programming problem for the M? unknowns {t;;}. Its solution then also defines the 
solution to the associated optimal transportation problem (3.6). Implementations of 
this approach for sequential data assimilation are discussed in [46]. 


More generally, sequential Monte Carlo methods differ by the way resampling is 
implemented and also in the choice of proposal step which, in our context, amounts 
to choosing transport maps Tn+1 in (3.36) which are different from the identity map. 
See also the discussion in Section 3.5 below. 


3.3 Ensemble Kalman filter (EnKF) 


We now introduce an alternative to sequential Monte Carlo methods which has be- 
come hugely popular in the geophysical community in recent years. The idea is to 
construct a simple, but robust transport map Ty, ‚|, which replaces Tn+1 in (3.36). This 
transport map is based on the Kalman update equations for linear SDEs and Gaussian 
prior and posterior distributions. We recall the standard Kalman filter update equa- 
tions. 


Proposition 3.33 (Kalman update for Gaussian distributions). Let the prior distribu- 
tion Ttx be Gaussian with mean xf and covariance matrix Pf. Observations yo are 
assumed to follow the linear model 


Y=HX+E, 


where E ~ N(0, R) and R is a symmetric, positive-definite matrix. Then, the posterior 
distribution Trx (x |o) is also Gaussian with mean 


x? =x! — pfHT(HPfHT + R) (Hx - yo) (3.39) 
and covariance matrix 
p® = pf — pfyT(HPfHT + R) HPS. (3.40) 


Here, we adopt the standard meteorological notation with superscript f (forecast) de- 
noting prior statistics, and superscript a (analysis) denoting posterior statistics. 


Proof. By straightforward generalization to vector-valued observations of the case of 
a scalar observation already discussed in Section 1.2. 


EnKFs rely on the assumption that the predicted PDF Ttn+1(x|Yn) is approxi- 
mately Gaussian. The ensemble {x;}//, of model states is used to estimate the mean 
and the covariance matrix using the empirical estimates (3.28) and (3.30), respective- 
ly. The key novel idea of EnKFs is to then interpret the posterior mean and covariance 
matrix in terms of appropriately adjusted ensemble positions. This adjustment can 


be thought of as a coupling of the underlying prior and posterior random variables of 
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which the ensembles are realizations. The original EnKF [9] uses perturbed observa- 
tions to achieve the desired coupling. 


Definition 3.34 (Ensemble Kalman Filter). The EnKF with perturbed observations for 
a linear observation operator h(x) = Hx is given by 


Xf = Xn + Atf (Xn) + V20tZn, (3.41) 
-1 
Xn+1 =, - PÍ, H" (HPÅ, HT +R) (Hi -= yo + Ens1) ; (3.42) 


where the random variables Zn ~ N(0,Q), =n+1 ~ N(0,R) are mutually indepen- 
dent, Yo = Yoltn+1), Khai = Eys [x], and 


Pia = byf i | (x - Xha) (x X.) | . 


Next, we investigate the properties of the assimilation step (3.42). 


Proposition 3.35 (EnKF consistency). The EnKF update step (3.42) propagates the 
mean and covariance matrix of X in accordance with the Kalman filter equations for 
Gaussian PDFs. 


Proof. It is easy to verify that the ensemble mean satisfies 
-1 
Ensi = Rn — Pan H" (Hy, H" u R) (Hes = yo) ; 


which is consistent with the Kalman filter update for the ensemble mean. Further- 
more, the deviation 6X = X — X satisfies 


-1 
5Xnrı = OX4,, - Pf, 1H" (HP, 1H" +R) (HX$; + Ensı) , 
which implies 
= 
Pası = Pf, - 2P. H" (HPI HT +R) HPI, 
=j 
+ Pf HT (HP, HT +R) R(HPJ,,HT+R) ‘apf 


n+1 


+ Pf HT (HP H" +R) ‘apf HT (HPI .jH™ +R) “pt 


n+l n+l 


f f =l 
= PL. - Pian (HPH +R) HP, 


n+l 


for the update of the covariance matrix, which is also consistent with the Kalman 
update step for Gaussian random variables. 


Practical implementations of the EnKF with perturbed observations replace the 
exact mean and covariance matrix by ensemble based empirical estimates (3.28) and 
(3.30), respectively. 
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Alternatively, we can derive a transport map T under the assumption of Gaussian 
prior and posterior distributions as follows. Using the empirical ensemble mean x, 
we define ensemble deviations by 6x; = xi - X € R and an associated ensemble 


deviation matrix 6X = (6x1,...,6xm) € RN*™. Using this notation, the empirical 
covariance matrix of the prior ensemble at tn+1 is then given by 
1 T 
Br ri (8X4.1) 


We next seek a matrix S € RY*M such that 
1 


P = 
n+l M-1 


f 3.3 
5X41 SS* (5Xi41) ’ 
where the rows of S sum to zero in order to preserve the zero mean property of 
ÔXn+1 = ox! „15. Such matrices do exist (see, e.g. [15]), and give rise to the ensemble 
square root filters. More specifically, Kalman’s update formula (3.40) for the posterior 
covariance matrix implies 


pt = — ox! Ir l i (oy/)' [HPfHT +R] ov! | (axt) 


M M- 
1 T 
z fssT f 
= pox SS" (ox!) , 
where we have dropped the time index subscript and introduced the ensemble pertur- 
bations SYf = HdXS in observation space Y. Recalling now the definition of a matrix 
square root from Section 1.3 and making use of the Sherman-Morrison-Woodbury 


formula [18], we find that 


1 1/2 


S= \1 er (oy!) [HPF H" + R|” sw 


u i+ — ; (ar) Ray} 


(3.43) 


The complete ensemble update of an ensemble square root filter is then given by 
x4 lin) = Kner + 5X ,1Ser, (3.44) 
where e; denotes the ith basis vector in R™ and 
-1 
Xn+1 = Xha = PÍ, H" (Hpi at + R) (Hah — Vo (in+1)) 


denotes the updated ensemble mean. 
We now discuss the update (3.44) from the perspective of optimal transportation 
which, in our context, reduces to finding a matrix Sor € RY*M such that the trace of 


cov (8X441,dXn+1) = E [okak (8%) | 


is maximized. 
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Proposition 3.36 (Optimal update for ensemble square root filter). The trace of the 
covariance matrix cov(ôXÍ +1» OXn+1) is maximized for 


OXn+1 = XÍ Sor 
with transform matrix 


1 
Sor = = 
vM-1 
and S € R™*™ given by (3.43). 


T -1/2 T 
Se a E Na 


Proof. Follows from (3.9) with A = ox! 415/VM — land =) = pf +1. The left multi- 
plication in (3.8) is finally rewritten as a right multiplication by Sor € RY in terms 
of ensemble deviations oxi ie 


We finish this section by briefly discussing a couple of practical issues. It is im- 
portant to recall that the Kalman filter can be viewed as a linear minimum variance 
estimator [14]. At the same time, it has been noted [30, 55] that the updated ensem- 
ble mean X7+1 is biased in case where the prior distribution is not Gaussian. Hence, 
the associated mean squared error (3.10) does not vanish as M — œ even though the 
variance of the estimator goes to zero. If desired, the bias can be removed by replac- 
ing Xn+1 in (3.44) by (3.11) with weights (3.12), where yo = Yo(tn+1) and xP" = 


t 
xf (tn+1). Higher-order moment corrections can also be implemented [30, 55]. How- 
ever, the filter performance only improves for sufficiently large ensemble sizes. 

We mention the unscented Kalman filter [23] as an alternative extension of the 
Kalman filter to nonlinear dynamical systems. We also mention the rank histogram 
filter [2], which is based on first constructing an approximative coupling in the ob- 
served variable y alone followed by linear regression of the updates in y onto the 
state space variable x. 

Practical implementations of EnKFs for high-dimensional problems rely on addi- 
tional modifications, in particular, inflation and localization. While localization mod- 
ifies the covariance matrix PÍ +1 in the Kalman update (3.42) in order to increase its 
rank and to localize the spatial impact of observations in physical space, inflation in- 
creases the ensemble spread 6x; = x; — X by replacing x; by X + a(x; - X) with 
& > 1. Note that the second term on the right-hand side of (3.31) achieves a similar ef- 
fect and ensemble inflation can be viewed as a simple parametrization of (stochastic) 
model errors. See [15] for more details on inflation and localization techniques. 


3.4 Ensemble transform Kalman-Bucy filter 
In this section, we describe an alternative implementation of ensemble square root fil- 


ters based on the Kalman-Bucy filter. We first describe the Kalman-Bucy formulation 
of the linear filtering problem for Gaussian PDFs. 
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Proposition 3.37 (Kalman-Bucy equations). The Kalman update step (3.39)-(3.40) 
can be formulated as a differential equation in artificial time s € [0,1]. The Kalman- 
Bucy equations are 

ae —PH'R™! (HX - yo) 

ds 
and 


= -PHT'R"!HP. 


EIS 


The initial conditions are X(0) = X’ and P(0) = Pf, and the Kalman update is ob- 
tained from the final conditions x" =X(1) and P? = P(1). 


Proof. We present the proof for N = 1 (one-dimensional state space) and K = 1 (a sin- 
gle observation). Under this assumption, the standard Kalman analysis step (3.39)- 
(3.40) gives rise to 

P/R _, XÍR + yoPf 
“PP eR? Ý OO PIIR ' 
for a given observation value yo. 

We now demonstrate that this update is equivalent to twice the application of 
a Kalman analysis step with R replaced by 2R. Specifically, we obtain 


a 


f 2PmR 2P/R 


a 
~ Pm+2R’  ™ PS+2R 


for the resulting covariance matrix P^ with intermediate value Pm. The analyzed 
mean X“ is provided by 


2X mR + YoPm _ 2XÍR + yoPf 


= Be Ue ee ee ae 


xe 


We need to demonstrate that P? = P@ and X" = x“. We start with the covariance 
matrix and obtain 


APR 


5a _ _PrroRR 4P/R? WR pe 
as +2R AP/R+AR? Pf+R ` 
F 
A similar calculation for x“ yields 
2 2x! R+yoPl p , 2P/R =f p? f 
ga PF+2R t Vopszor _ 4X RT +4yoP R _ a 
2PFR AR2 + 4RPS 

2R + STR 


Hence, by induction, we can replace the standard Kalman analysis step by D > 2 
iterative applications of a Kalman analysis with R replaced by DR. We set Po = Pf, 


Xo = xf , and iteratively compute Pj.) from 


DP;R — Dx;R+ yoPj 


De, = 
it Pi + DR Pj +DR 


Xj+1 = 
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for j = 0,...,D - 1. We finally set P? = Pp and X“ = Xp. Next, we introduce a step- 
size As = 1/D and assume D > 1. Then, 


Er XjR T AsyoP; = = 
en A Xj -AsP;R"!(X; - yo) + O(As?) 
as well as ER 
j - 
ag Ti P;-AsP;R !P;+0(As?). 


Taking the limit As — 0, we obtain the two differential equations 


— =-PR'!P, I ep 

ds (xX — yo) 
for the covariance and mean, respectively. The equation for P can be rewritten in 
terms of its square root Y (i.e. P = Y2) as 

dy 1 


di _ _+pp-l 
ds „PR Ys (3.45) 


Upon formally setting Y = ö6X/vM - 1 in (3.45), the Kalman-Bucy filter equa- 
tions give rise to a particular implementation of ensemble square root filters in terms 
of evolution equations in artificial time s € [0,1]. 


Definition 3.38 (Ensemble transform Kalman-Bucy filter equations). The ensemble 
transform Kalman-Bucy filter equations [1, 5, 6] for the assimilation of an observation 
Yo = Yo(tn) at tn are given by 

dxi 1 


at = =a PARA (Hx; + HX - 2yo(tn)) 


in terms of the ensemble members x;, i = 1,...,M, and are solved over a unit time 
interval in artificial time s € [0,1]. Here, P denotes the empirical covariance matrix 
(3.30) and X denotes the empirical mean (3.28) of the ensemble. 


The Kalman-Bucy equations are realizations of an underlying differential equa- 
tion ax i 
oe TEURS (HX + HX - 2y0(tn)) (3.46) 


in the random variable X with mean 


x = Ex[x] = | xrxax 


and covariance matrix 


P= Fx [œ -X)(x - x)" ] l 
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The associated evolution of the PDF Try (here assumed to be absolutely continu- 
ous) is given by Liouville’s equation 


a ee (3.47) 
Os 
with vector field 
v(x) = -5PH™R™ (Hx + HX - 2yo(ty)) - (3.48) 


Recalling the earlier discussion of the Fokker—Planck equation in Section 2.2, we note 
that (3.47) with vector field (3.48) also has an interesting geometric structure. 


Proposition 3.39 (Ensemble transform Kalman-Bucy equations as a gradient flow). 
The vector field (3.48) is equivalent to 


OF 
v(x) = TEN k Sry 


with potential 


1 
F (mtx) = 5 | (Hx - yo (tn) RO (Hx = yo (tn)) Tex 
RN 


(3.49) 


1 u = BE 
+ 7 (HX - yo (tn))' Ro! (HX — yo (tn)) - 
Liouville’s equation (3.47) can be stated as 


OTTx 


aa = -Vx + (xv) = -grad„,F (TTX) , 


where we have used M = P in the definition of the gradient (3.25). 


Proof. The result can be verified by direct calculation. 


Nonlinear forward operators can be treated in this framework by replacing the 
potential (3.49) by, for example, 


iji 
F (mex) = 5 | (ht) = yo (tn) Ro! (x) = yo (tn)) xax 
RN 
1 AR = we 
+ 7 (h Œ -= Yo (tn)! R (W(X) - Yo (tn)) . 
Efficient time-stepping methods for the ensemble transform Kalman-Bucy filter equa- 
tions are discussed in [1] and an application to continuous data assimilation can be 


found in [6]. 
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3.5 Guided sequential Monte Carlo methods 


EnKF techniques are limited by the fact that the empirical PDFs do not converge to 
the filter solution in the limit of ensemble sizes M — œ unless the involved PDFs 
are Gaussian. Sequential Monte Carlo methods, on the other hand, can be shown 
to converge under fairly general assumptions, but they do not work well in high- 
dimensional phase spaces since importance sampling is not sufficient to guarantee 
good performance of a particle filter for finite ensemble sizes. In particular, the vari- 
ance in the associated mean squared error (3.10) can be very large for ensemble sizes 
typically used in geophysical applications. 

The combination of modified particle positions and appropriately adjusted par- 
ticle weights appears therefore as a promising area for research and might achieve 
a better bias-variance trade-off than either the EnKF or traditional sequential Monte 
Carlo methods. In particular, combining ensemble transform techniques, such as 
EnKF, with sequential Monte Carlo methods appears as a natural research direction. 
Indeed, in the framework of Monte Carlo methods discussed in Section 1.4, the stan- 
dard sequential Monte Carlo approach consists of importance sampling using pro- 
posal PDF try (x) = Tn+1(x|Yn) and subsequent reweighting of particles according 
to (3.12). Also, as discussed in Section 1.4, the performance of importance sampling 
can be improved by applying modified proposal densities with the aim of pushing the 
updated ensemble members x; (t„+1) to regions of high and nearly equal probability 
in the targeted posterior PDF 17,41(x|Yn+1) (compare with equation (3.14)). We call 
the resulting filter algorithms guided sequential Monte Carlo methods. 

More precisely, a guided sequential Monte Carlo method is defined by a condi- 
tional proposal PDF ftn+1(xX' |x, Yo(tn+1)) and an associated joint PDF 


Tyx(X,x|Yn+ı) = Tri (XIX, Yoltnsı)) Tn(XlYn). (3.50) 


An ideal proposal density (in the sense of coupling) should be identical to the poste- 
rior PDF Tty+1 (X |Yn+1). In guided sequential Monte Carlo methods, a mismatch be- 
tween the proposal density and Tty+1(x|Yn+1) is treated by adjusted particle weights 
wi(tn+1). Following the general methodology of importance sampling, one obtains 
the recursion 


Tey (Yo (ins) 1x) e (x; 1x1) 


Tn+1 (x), Yo (tn+1)) 


Wi (tn+1) © Wi (tn). 


Here, ır(x’|x) denotes the conditional PDF (3.18) describing the model dynamics, 
(x;, xi), i = 1,...,M, are realizations from the joint PDF (3.50) with weights wi (tn), 
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Xi = Xi(tn), and the approximation 


CXn+1 [g] = 


[r x", x) Ttxx (Xx’,x|Yn+ı) dx’dx 
Try ( a = 


5j] 
S (x; xi) 


Try = (tn+1)) = 


x 3 Wi (tn+1) J (a) 


with 
Try (Yo (tn+1) |x") T (x'|x) 
Ttn+1 (X |X, Yo (tn+1)) 
has been used. The guided sequential Monte Carlo method is continued with 
xi(tn+1) = x; and new weights wi(tn+1). 

Numerical implementations of guided sequential Monte Carlo methods have been 
discussed, for example, in [7, 11, 28, 36]. More specifically, a combined particle and 
Kalman filter is proposed in [28] to achieve almost equal particle weights (see also 
the discussion in [7]), while in [11, 36], new particle positions x;(tn+1) are defined 
by means of implicit equations. We emphasize that both implementation approaches 
give up the requirement of unbiased estimation in hopes of reduced variance at finite 
ensemble sizes and hence for an overall reduction of the associated mean squared 
error (3.10). See also [45] for a discussion of guided sequential Monte Carlo methods 
from a coupling and transport perspective. 

Another broad class of methods is based on Gaussian mixture approximations to 
the prior PDF Trn+1(x|Yn). Provided that the forward operator h is linear, the poste- 
rior PDF Ttn+1(X |Yn+ı) is then also a Gaussian mixture and several procedures have 
been proposed to adjust the proposals xf (ty+1) such that the adjusted x; (tn+ı) ap- 
proximately follow the posterior Gaussian mixture PDF; see, for example, [16, 49, 50]. 
Broadly speaking, these methods can be understood as providing approximate trans- 
port maps T,,, |, instead of an exact transport map Tn+1 in (3.36). However, none of 
these methods avoid the need for particle reweighting and resampling. Recall that re- 
sampling can be implemented such that it corresponds to a nondeterministic optimal 
transference plan. 

The following section is devoted to an embedding technique for constructing ac- 
curate approximations to the transport map Tn+1 in (3.36). 


f (x',x) = g(x’) 


3.6 Continuous ensemble transform filter formulations 


The implementation of (3.36) requires the computation of a transport map T. Opti- 
mal transportation (i.e. maximizing the covariance of the transference plan), leads to 
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T = V,w and the potential satisfies the highly nonlinear, elliptic Monge-Ampere 
equation 
Tx, (Vxw)|DVxw| = TTX, = 


A direct numerical implementation for high-dimensional state spaces X = RN seems 
to be presently out ofreach. Instead, in this section, we utilize an embedding method 
due to Moser [38], replacing the optimal transport map by a suboptimal transport 
map which is defined as the time-one flow map of a differential equation in artificial 
times € [0, 1]. At each time instant, determining the right-hand side of the differen- 
tial equation requires the solution of a linear elliptic PDE; nonlinearity is exchanged 
for linearity at the cost of suboptimality. In some cases, such as Gaussian PDFs and 
mixtures of Gaussian, the linear PDE can be solved analytically. In other cases, fur- 
ther approximations, for example, the mean field approach discussed later in this 
section, are necessary. 

Inspired by the embedding method of Moser [38], we first summarize a dynamical 
systems formulation [43] of Bayes’ formula which generalizes the continuous EnKF 
formulation from Section 3.4. We first note that a single application of Bayes’ formula 
(3.4) can be replaced by an D-fold recursive application of the incremental likelihood 
TT: 


f(y |x) = exp ( + (h(x) - y)" R! (h(x) -y)) x GSI) 


(271) K/2|R1/2 
i.e. we first write Bayes formula as 


D 
Thx (X|Vo) © Tx (x) Ta r(volx), 


where the constant of proportionality depends only on yo, and then consider the 
implied iteration 

Tt; (xX) T(Yolx) 
fes dx Tej (x) T(YoIlx) 


Tj+1(X) 


with to = Try and Trx(-|Yo) = Tp. We may now expand the exponential function in 
(3.51) in the small parameter As = 1/D in the limit D — œ, obtaining the evolution 


equation 
Chis 


Os 
in the fictitious time s € [0,1]. The scalar Lagrange multiplier u is equal to the ex- 
pectation value of the negative log likelihood function 


; (h(x) yo) R! (h(x) — yo) T + uT (3.52) 


L(x; yo) = = (h(x) - yo)” R7! (h(x) - yo) (3.53) 


with respect to Tr and ensures that |x (Ott /0s)dx = 0. We also set TT (x, 0) = Trx (x) 
and obtain 7ry(x|¥o) = T(x, 1). 
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We now rewrite (3.52) in the equivalent, but more compact, form 


au = -T (L -I) (3.54) 


where L = Ex[L] and Ex denote expectation with respect to the PDF tx = T (+, s). 
It should be noted that the continuous embedding defined by (3.54) is not unique. 
Moser [38], for example, used the linear interpolation 


T(x,S) = (1- S)TIx(X) + sSTrx (X | Yo) , 


which results in 


A = Ttx (x| Yo) — Tx (xX). (3.55) 


Yet another interpolation is given by the displacement interpolation of McCann which 
is based on the optimal transportation map and which has an attractive “fluid dynam- 
ics” interpretation [52, 53]. 

Equation (3.54) (or, alternatively, (3.55)) defines the change (or transport) of the 
PDF r in fictitious time s € [0,1]. Alternatively, following Moser’s work [38, 52], we 
can view this change as being induced by a continuity (Liouville) equation 


om 
ae = =V% A (Tg) (3.56) 
for an appropriate vector field g(x, s) € RN. 

Atany time s € [0, 1], the vector field g(-, s) is not uniquely determined by (3.54) 
and (3.56) unless we also require that it is the minimizer of the kinetic energy 


T(v)= : Í nv!M!vdx 
RN 


over all admissible vector fields v : RN — RN (i.e. g satisfies (3.56) for given 7 and 
dt /ds), where M € RY*N is a positive definite matrix. Under these assumptions, 
minimization of the functional 


£[v,&®] = : Í mv'M tv dx + Í o{= + Vx: (rv) | ax 
RN RN 


for given Tr and ðr /ðs leads to the Euler-Lagrange equations 


OTT 


-1,_ z BEE 
TM"g- TV Ww=0, 35 


+ Vx + (mg) =0 


in the velocity field g and the potential y. Hence, provided that 7 > 0, the desired 
vector field is given by g = MV xy, and we have shown the following result. 
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Proposition 3.40 (Transport map from gradient flow). If the potential w(x,s) is the 
solution of the elliptic PDE 


Vx + (txMV xy) = Tx(L- L), (3.57) 


then the desired transport map x’ = T(x) for the random variable X with PDF 
Ttx (x, s) is defined by the time-one flow map of the differential equations 

dx 

— = -MVxw. 

ae xy 
The continuous Kalman-Bucy filter equations correspond to the special case M = P 
and w = OF / 6tTtx with the functional F given by (3.49). 


The elliptic PDE (3.57) can be solved analytically for Gaussian approximations to 
the PDF Try and the resulting differential equations are equivalent to the ensemble 
transform Kalman-Bucy equations (3.46). Appropriate analytic expressions can also 
be found in the case where Trx can be approximated by a Gaussian mixture and the 
forward operator h(x) is linear (see [44] for details). 

Gaussian mixtures are contained in the class of kernel smoothers. It should how- 
ever be noted that approximating a PDF Try over high-dimensional phase spaces 
X = RN using kernel smoothers is a challenging task, especially if only a relatively 
small number of realizations x;,i=1,...,M, from the associated random variable X 
are available. 

In order to overcome this curse of dimensionality, we outline a modification to 
the above continuous formulation, which is inspired by the rank histogram filter of 
Anderson [2]. For simplicity of exposition, consider a single observation y € R with 
forward operator h : RN — R. We augment the state vector x € R by y = h(x), ie. 
we consider (x, y) and introduce the associated joint PDF 


Tixy (X, Y) = Tx (x|y)Tty(y) . 


We apply the embedding technique first to y alone, resulting in 


dy _ 
ds = fy(y,5) 


with 
dy (Try (Y) fy (¥)) = Ty (V)(L - L). 


One then finds an equation in the state variable x € R from 
Vix» (RAN) IS) + Fr, S)0yTIx(x|y¥) = 0 
and 


dx 
ds = f(x, 9,8). 
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Next, we introduce the mean field approximation 
T(x" |y)T2(x*|y) +++ T(x |y) (3.58) 


for the conditional PDF rx(x|y) with the components of the state vector written 
asx = (x!,x?,...,xN)T © RN. Under the mean field approximation, the vector 
field fx = (fra, fx2,---, fxn)? can be obtained component-wise by solving scalar 
equations 

Oz (Te (ZIV) far (2,9)) + fy (V) Oy Tt (zly) = 0, (3.59) 
k = 1,...,N, for fx«(z, y) with z = x* € R. The (two-dimensional) conditional 
PDFs Tr. (x*|y) need to be estimated from the available ensemble members x; € RN 
by either using parametric or nonparametric statistics. 

We first discuss the case for which both the prior and the posterior distributions 
are assumed to be Gaussian. In this case, the resulting update equations in x € RN 
become equivalent to the ensemble transform Kalman-Bucy filter. This can be seen 
by first noting that the update in a scalar observable y € R is 

ay = -502R (y +F- 2yo). 
Furthermore, if the condition PDF mrg (z|y), z = xk € R, is of the form (3.1), then 
(3.59) leads to 
Falk, y) = 02,05, J,(V); 
which, combined with the approximation (3.58), results in the continuous ensemble 
transform Kalman-Bucy filter formulation discussed previously. 

The rank histogram filter of Anderson [2] corresponds in this continuous embed- 
ding formulation to choosing a general PDF Try (y), while a Gaussian approximation 
is used for the conditional PDFs Trg (x*| y). 

Other ensemble transform filters can be derived by using appropriate approxima- 


tions to the marginal PDF try and the conditional PDFs Trg (x* | y),k=1,...,N, from 
the available ensemble members x;, i = 1,...,M. 
References 


An excellent introduction to filtering and Bayesian data assimilation is [22]. The linear 
filter theory (Kalman filter) can, for example, be found in [48]. Fundamental issues 
of data assimilation in a meteorological context are covered in [25]. Ensemble filter 
techniques and the ensemble Kalman filter are treated in depth in [15]. Sequential 
Monte Carlo methods are discussed in [3, 4, 13] and by [7, 27] in a geophysical context. 
See also the recent monograph [19]. The transport view has been proposed in [12] for 
continuous filter problems and in [43] for intermittent data assimilation. Gaussian 
mixtures are a special class of nonparametric kernel smoothing techniques which are 
discussed, for example, in [54]. 


132 —— Sebastian Reich and Colin J. Cotter 


4 Concluding remarks 


We have summarized the Bayesian perspective on sequential data assimilation and 
filtering in particular. Special emphasis has been put on discussing Bayes’ formula in 
the context of coupling of random variables, which allows for a dynamical system’s 
interpretation of the data assimilation step. Within a Bayesian framework, all vari- 
ables are treated as random. While this implies an elegant mathematical treatment of 
data assimilation problems, any Bayesian approach should be treated with caution 
in the presence of sparse data, high-dimensional model problems, and limited sam- 
ple sizes. It should be noted in this context that successful assimilation techniques 
such as 4DVar (not covered in this survey) and the EnKF lead to biased approxima- 
tions to the state estimation problem. In both cases, the bias is due to the fact that the 
algorithms are derived under the assumption that the prior distributions are Gaus- 
sian. Nevertheless, 4DVar and EnKF often work well in terms of the observed mean 
squared error (3.10) since the variance of the estimator remains small, even for rel- 
atively small ensemble sizes M. On the contrary, asymptotically unbiased Bayesian 
approaches such as sequential Monte Carlo methods suffer from the curse of dimen- 
sionality, generally lead to large variances in the estimators for small M and have 
therefore not yet found systematic applications in operational forecasting, for exam- 
ple. To overcome this limitation, one could consider more suitable proposal steps 
such as guided sequential Monte Carlo methods and/or impose certain independence 
assumptions such as mean field approximations which lead to an improved balance 
between bias and variance in the mean squared error (3.10). See also the discussion 
of [20] on the bias-variance trade-off in the context of supervised learning. Promising 
results for guided particle filters have been reported very recently in [29, 35]. Alterna- 
tively, non-Bayesian approaches to data assimilation could be explored in the future, 
for example, (i) shadowing for partially observed reference solutions, (ii) a nonlinear 
control approach with transport maps as dynamic feedback laws, and (iii) derivation 
and analysis of ensemble filter techniques within the framework of stochastic inter- 
acting particle systems. 
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Abstract: This chapter provides an overview of inverse problems in imaging, with 
a particular focus on biomedical imaging applications and current developments. We 
discuss some basics in the mathematical modeling of images, image reconstruction, 
and imaging devices. Then, we proceed to three topics of high current interest, name- 
ly, problems with missing data as appearing in inpainting or imaging from surface 
measurements, nonlinear inverse problems created by the need to perform additional 
calibrations, and finally high-dimensional inverse problems in dynamic imaging. 


Keywords: Inverse problems, imaging, image reconstruction, inpainting, blind decon- 
volution, dynamic imaging 


2010 Mathematics Subject Classification: 65N21, 35R20, 92C55, 65M32 


Martin Burger: Institute for Computational and Applied Mathematics, University of Münster, 
Einsteinstrasse 62, 48149 Münster, Germany, martin.burger@wwu.de 

Hendrik Dirks: Institute for Computational and Applied Mathematics, University of Münster, 
Einsteinstrasse 62, 48149 Münster, Germany, hendrik.dirks@uni-muenster.de 

Jahn Müller: Institute for Computational and Applied Mathematics, University of Münster, 
Einsteinstrasse 62, 48149 Münster, Germany, jahn.mueller@uni-muenster.de 


Nowadays, life is hard to imagine without the use of images and videos, with in- 
creasing fraction in digital format. While humans conveyed a huge amount of in- 
formation via audio and audio-type signals (speeches, telephone, telegraphs, radio) 
one-hundred years ago, we increasingly use image and video-based methods now 
(television, computers, internet). This also applies to many other parts of daily life, 
engineering, medicine, and science. As examples, consider the transition from stetho- 
scopes to modern medical imaging devices or from ground based meteorological sta- 
tions to satellite-based weather surveillance. 

The increasing use of images and videos has led to a new branch of science, of- 
ten called imaging science, where mathematical methods play an important role. In- 
verse problems are an important part in this area since they arise at two fundamental 
points: 
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and Klaus Schäfers (EIMI, Münster) for permission to use PET data, and Eric Schulze-Bahr (Cardiolo- 
gy, Münster) for permission to use cardiac imaging data. 
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e The way to the image: Most measurement devices are not able to automatical- 
ly deliver high-quality images, but rather raw data from which images need to 
be reconstructed. Image reconstruction is a classical inverse problem, particular- 
ly tomographic setups are currently widely studied with different mathematical 
techniques. 

e The (quantitative) interpretation of the image: In many cases, images and videos 
are of interest due to the quantitative information they carry. In order to benefit 
from the latter, an appropriate link to mathematical models is needed which can 
also be cast in the framework of inverse problems. 


We will provide a brief overview of the mathematical issues arising from these two 
questions in this chapter with a focus on recent developments and open questions. 
Our aim is by no means to give an extensive overview of mathematical techniques 
nor of imaging devices and problems. Rather, we focus on variational methods which 
allow for a unified treatment of large classes of problems and build on a sound math- 
ematical background, and on certain classes of problems that we think can highly 
benefit from further development in inverse problems. We start with the basic math- 
ematical modeling of images and their properties in Section 1 and then proceed to 
some examples of imaging devices and related mathematical models in order to fur- 
ther motivate the subsequent investigations. In the subsequent Section 3, we review 
some classical mathematical problems in image reconstruction. Afterwards, we turn 
to three areas of high current interest: The first are effectively underdetermined prob- 
lems discussed in Section 4, where the appropriate incorporation of prior informa- 
tion becomes of ultimate importance. The second are problems usually arising in fast 
measurements, when the system parameters cannot be well calibrated and need to be 
estimated together with the image, as we will highlight in Section 5. Finally, we will 
discuss dynamic problems in imaging, i.e. related to videos, and the link to mathe- 
matical models for the dynamics in Section 6. 


1 Mathematical models for images 


Images can be modeled as densities or intensities (gray values) on an image domain 
Q c R4, which means the image u : Q — R is simply a nonnegative function. Fre- 
quently, the density is not a priori a distribution of gray values, but directly a density 
of some physical quantity carrying quantitative information, e.g. tracer substances 
in medical imaging. Thus, inverse problems dealing with images as unknowns can 
directly be related to the bulk of other inverse problems dealing with reconstructing 
functions. A major difference to the majority of such inverse problems is that images 
are not expected to be smooth functions, but have specific structures of particular 
importance: 
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e Edges: A highly important part of images are edges, which are related mainly to 
discontinuities in the function u. The edges and the cartoon, i.e. a piecewise con- 
stant approximation of the image between the edges, are often the first quantity 
of interest in the interpretation of the image. Hence, it is of high importance that 
solution methods for inverse problems do not destroy edges. 

e Textures: In natural images, these are small scale patterns, i.e. locally struc- 
tured high-frequency information. Different patterns are usually separated by 
the edges, and thus the interplay with edges is crucial. Since the high-frequency 
information is highly damped by typical forward operators, it is often out of reach 
to reconstruct textures in inverse problems. 

e Morphology: Images are frequently interpreted in a morphological way, i.e. the 
exact gray values are of limited interest, but rather the isocontours or the level 
sets of the image provide the relevant information. 


It has become a standard setting to consider cartoon images as functions of 
bounded variation u € BV(Q). Assuming a normalization of the image such as 


Ju ax = 1, (4.1) 
Q 


the cartoon should be characterized by a rather small total variation 


TV (u) = sup Juv -gdx. (4.2) 
gE Cy (ŒR) lgl <1 a 


Note that for a piecewise constant function u, i.e. an image consisting of regions with 
homogeneous gray values separated by sharp edges, the total variation is just the 
perimeter of the jump set weighted by the jump height. Another description of the 
cartoon comes from the work of Mumford and Shah [71], and was originally designed 
for image segmentation. Their description consists of an edge set T C Q and a smooth 
component u € H! (Q \ T). The corresponding functional that is thought to be small 
for good cartoon images is of the form 


Mea Í |Vul? dx + HENT), (4.3) 
O\T 


where H^! denotes the d — 1-dimensional Hausdorff-measure. 

The texture part is more difficult to characterize, as it is usually attributed to 
oscillatory parts in the image, but consequently difficult to separate from potential 
noise. In analogy to the cartoon part, Meyer [66] proposed a dual approach and tried 
to characterize texture as parts v with ||v || rather small, where for a distribution 
v € BV(Q)* with zero mean, 


lvli = sup (v, p). (4.4) 
peBV(O),TV(p)<l 
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Other approaches are based on representations in negative Sobolev spaces ([78, 98]), 
nonlocal versions of total variation or Sobolev norms exploiting similarities of patch- 
es, which will be discussed below. While image decomposition into structure and tex- 
ture is a highly relevant problem in many parts of image processing, it is often of less 
importance for inverse problems in imaging. The main reason is asmoothing property 
ofthe forward operators which are usually strongly damping the high-frequency com- 
ponents. Hence, in reconstructions of images, the focus is laid on the cartoon parts, 
which is also a reason why total variation is a very popular penalty in variational reg- 
ularization methods (cf. [22] for a detailed discussion of total variation reconstruction 
methods). 

Several bases or frames such as wavelets, curvelets, or shearlets have been pro- 
posed to efficiently represent images. They are based on multiscale decompositions, 
usually in a dyadic rescaling of space. £!-norms on the coefficients of such systems, in 
particular on wavelet coefficients, induce norms on Besov spaces. A particularly well 
studied case is the Besov space Bhi which is quite close to the space of functions of 
bounded variation. Also, the wavelet approximation of total variation functionals has 
been frequently studied (cf., e.g. [24, 34]). 

A strong recent trend are nonlocal approaches for images motivated by the non- 
local filter introduced by Buades and coworkers [19]. Roughly speaking, the idea is 
to interpret an image not as a collection of single gray values, but as a collection of 
local patches. A corresponding continuum model is to consider the image as a func- 
tion U on Q x &, where = is a small neighborhood of the origin modeling a patch. The 
consistency is obtained by U (x,y) = U(x + y,0) for all y € 2. From this space 
of patch-functions, a set of weights w(x, &) forx,& € Q is computed by comparing 
patches, i.e. the functions U(x, -) and U(Z, -). This yields a weighted graph structure 
on the image which can be further analyzed ([19, 59, 85, 88]). One option is to use dis- 
crete calculus on graphs to define analogues of total variation or other functionals for 
these patch-functions ([48, 59, 85]). In particular, for natural images, such approach- 
es yield superior results in many tasks such as denoising since one can exploit that 
similar patches appear several times within the image, e.g. in textures. 

In most areas of imaging, in particular those related to inverse problems where 
one does not just play with given images, variational approaches (respectively Bayes- 
ian methods with particular focus on MAP estimation) have become a standard tool. 
There are two natural functionals involved, namely, the fidelity D(u, f) (which can 
be interpreted as the negative log-likelihood of obtaining the data f conditioned on 
the image u) and the regularization functional R (u). A standard solution approach 
is the minimization of the energy functional 


E(u) =AD(u, f) +R(u), (4.5) 


with a weighting parameter A > 0. Clearly, such approaches are an equivalent formu- 
lation to Tikhonov-type regularization in inverse problems, with regularization pa- 
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rameter & = E We will discuss the detailed modeling of prior information and the 
relation to Bayesian models below. 


2 Examples of imaging devices 


In the following, we give a brief overview of the most frequently used types of de- 
vices for acquiring imaging data. We focus on the basic structures and implications 
for mathematical modeling and inverse problems rather than on the detailed device 
physics and specific application context. 


2.1 Optical imaging 


Optical imaging based on recording photons (usually found in CCD devices nowa- 
days) is probably the most intuitive way of obtaining image data used in various 
digital camera systems (from microscopes over hand-held systems, high level movie 
cameras up to astronomical telescopes). In this case, one can interpret the record- 
ing directly as an image via translating the number of photons (possibly at different 
wavelengths) into a grayscale (or a color scale). Effects to be taken into account in for- 
ward models are certain factors that can lead to a convolution (e.g. defocus) or make 
it necessary to investigate the structure of the noise (e.g. low light intensity). 

Active research in optical imaging is still related to denoising and deconvolution, 
also in the version of blind deconvolution as we shall discuss below. Since, in many 
cases, the recorded images or image sequences themselves are of reasonable quality, 
most research is rather related to processing digital images and videos. Another quite 
active field is to acquire three-dimensional information from stereo or other multi- 
camera systems. 

The applications of optical imaging are ubiquitous as digital images and videos 
are part of almost everyone’s daily life in the modern world. In addition to usual op- 
tical frequencies, an increasing number of devices use other or larger parts of the fre- 
quency band of electromagnetic waves. In particular, multi- and hyperspectral imag- 
ing is a strong trend since it can provide much better information than just the usual 
three primary colors we can distinguish. 


2.2 Transmission tomography 


In transmission tomography, rays (X-rays, electrons) are sent through the object from 
different positions and their attenuation is recorded on the opposite side. The classi- 
cal forward model is the Radon transform, i.e. the line integrals of the object density, 
since the attenuation is proportional to the density along the line. The principle of 
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Figure 4.1: Illustration of transmission tomography: Micro-CT imaging of a kid toy. Top row: two 
projections from different angles. Bottom row: 3D Image reconstruction (threshold segmented). 
Data courtesy of European Institute for Molecular Imaging and SFB 656, Münster. 
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transmission tomography is illustrated by micro-CT data in Figure 4.1, with two pro- 
jections from different sides and the final 3D image reconstruction. 

Tomography is a very well studied mathematical topic (cf., e.g. [73, 74]). Current 
challenges are related to exact reconstruction formulas for three-dimensional scan- 
ning geometries and problems with limited data, e.g. limited angles which are severe- 
ly ill-posed in contrast to the full data case. 

Major applications of transmission tomography are in medicine and material test- 
ing. In physics and biology, there is increasing interest in electron tomography for 
visualizing three-dimensional structures at the nanoscale ([43, 60]). 


2.3 Emission tomography 


Emission tomography such as positron emission tomography (PET) and single pho- 
ton emission computed tomography (SPECT) are based on recording photons emitted 
in the case of radioactive decay of some tracer inside the body. Since the radioactive 
decayis random, the forward models for emission tomography naturally need to be of 
stochastic nature. In PET, one uses tracers emitting photons to the opposite direction, 
and thus to each recorded coincidence of photons, one can attribute a decay event on 
the line in between. The tracers used in SPECT only emit a single photon in random 
direction and one uses collimators to get information about the direction, that is, the 
line on which the decay event has taken place. Since clearly the probability of a decay 
event is proportional to the tracer density along the corresponding interval, one ob- 
tains a stochastic sampling of the Radon transform in both cases. It is quite standard 
to use the Poisson distribution as a model for the randomness of the decay. Subtle 


Figure 4.2: Illustration of emission tomography: Reconstruction of a cardiac PET scan in two 
different slice views. Data courtesy of SFB 656, Miinster (subproject C1). 
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differences between PET and SPECT are in the detailed modeling of attenuation. We 
will come back to the issues arising in SPECT below. 

The major application of emission tomography is nuclear medicine. Due to their 
potential of monitoring time-dependent physiological processes (in a quite specific 
way when using appropriate tracers), these techniques have received increasing at- 
tention within in vivo imaging. The trade-off between spatial resolution and speci- 
ficity in emission tomography is illustrated in Figure 4.2 via a cardiac scan clearly 
displaying the concentration of the tracer in the ventricles. 


Figure 4.3: Illustration of MR imaging: Reconstruction of a cardiac MR scan of the same subject as in 
Figure 4.2, in two different slice views (top) and 3D visualization (bottom) of two different time 
frames. Data courtesy of SFB 656, Münster (subproject C1). 
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2.4 MR imaging 


Magnetic Resonance (MR) imaging is probably the technique with the most complex 
physics, though on the other hand, it yields the least ill-posed reconstruction prob- 
lem. The principle of MR is to induce nuclear magnetic relaxation of protons or cer- 
tain molecules by applying magnetic fields. The measurements consist of electrical 
signals at the same frequency. 

The usual forward model in MR is related to the Fourier transform, and hence the 
inverse problem of reconstructing the image from full data is well-posed. However, the 
acquisition in slices is rather time-consuming, and thus a main challenge is to make 
MR faster in order to obtain high resolution images with increasing time resolution. 
The current image quality in MR is illustrated by a cardiac MR scan in Figure 4.3. 

The major application of MR is nowadays a medical one. Due to the absence of 
radiation exposure compared to X-ray or radioactivity based techniques, MR can be 
frequently used for various tasks. Functional versions of MR scans play a prominent 
(and recently strongly debated) role in neurosciences. 


2.5 Acoustic imaging 


In acoustic imaging, like in ultrasound or seismic data acquisition, an acoustic wave 
is usually sent into the body or earth, and the echo is recorded on the surface. The nat- 
ural forward model is the wave equation with a spatially varying wave speed, and the 
inverse problem is to reconstruct the wave speed. Mainly for computational reasons, 
frequency domain formulations or several approximations (e.g. Eikonal equations or 
first arrival data) of the wave equation have been used in the past. 

While inversion is quite frequently used in geophysics, it is used less in medical 
ultrasound since one can usually interpret the data directly. In the latter, image pro- 


Figure 4.4: Illustration of acoustic imaging: Cardiac echo scan of the same subject as in Figure 4.2. 
Left: closed mitral valve during systolic phase. Right: opened mitral valve during diastolic phase. 
Data courtesy of SFB 656, Miinster (subproject C1). 
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cessing and automatic image analysis techniques are of interest, particularly for deal- 
ing with the large speckle noise artifacts appearing in such image sequences ([93]). 
A lot of recent interest in inversion has been in the novel hybrid technique of photoa- 
coustic imaging, where the acoustic wave is modulated by an optical laser ([100]). 

A major advantage of ultrasound imaging is the inherently high time resolution 
which allows one to monitor relevant processes such as, e.g. heartbeat. Together with 
speckle artifacts, this is illustrated in Figure 4.4. 


2.6 Electromagnetic imaging 


In electromagnetic imaging, one records electrical potentials created from currents in- 
side or on the boundary of the object in arrays of electrodes around the object surfaces 
or magnetometers in small distance to the surface. The forward models are clearly 
the Maxwell equations or in many cases, reductions to the Poisson equation (and the 
Biot-Savart law for the magnetic field). The term imaging in such applications is often 
debated since the forward problems are severely ill-posed and the reconstructions are 
hence of limited quality and mainly restricted to low frequency components. 

Due to the high remaining challenges in electromagnetic imaging, this is a very 
active field of research in applied mathematics, particularly in inverse problems. 
In pure surface imaging of electrical activity created inside the object, a major 
challenge is the appropriate modeling of prior information to decrease or elimi- 
nate the nonuniqueness of the reconstruction problem. The technology of electrical 
impedance tomography ([33]), where different currents between the electrodes are 
sent into the body and the resulting potentials are measured, received enormous at- 
tention in inverse problems and is also known as the Calderon problem ([25]). The 
inversion can be formulated as reconstructing the conductivity in the Poisson equa- 
tion from the knowledge of the Dirichlet-to-Neumann (or Neumann-to-Dirichlet) map, 
which has a rich mathematical structure. 

Besides material testing and geology, applications of electromagnetic imaging 
have been found recently in medicine, e.g. in brain (EEG/MEG), heart (ECG/MCG) or 
muscle studies (EMG). The technique of EIT is mainly applied in material testing and 
in monitoring lung activity. Also, hybrid techniques find increasing attention ([4, 64]). 


3 Basic image reconstruction 


The first fundamental step is the reconstruction of images from raw data. The most 
prominent image reconstruction problem nowadays is related to X-ray tomography, 
which is based on inverting the Radon transform as we shall recall below. In optical 
imaging devices like photography, telescopes, or microscopy, one obtains an image 
directly, but it quite frequently suffers from defects or does not yet have the desired 
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resolution. Here, the reconstruction step can also be interpreted as a correction step, 
e.g. of defocus or atmospheric blur. 

In a canonical mathematical formulation, classical image reconstruction can be 
formulated as the solution of a linear operator equation 


Ku=f, (4.6) 


with given (noisy) data f and a usually compact forward operator K. Due to the non- 
closed range of compact operators, most image reconstruction problems become ill- 
posed problems in the sense of Hadamard ([42]). We shall discuss some standard is- 
sues in the following. 


3.1 Deblurring and point spread functions 


A standard problem in imaging is blur, e.g. caused by lack of focus or motion. The 
mathematical model for blurring is an integral operator of the form 


Ku(x) = fka - y, yyu(y)dy, (47) 
Q 


where often the Point Spread Function (PSF) k is approximated as spatially indepen- 
dent, i.e. k only depends on the first variable. 

In many cases, blur can be approximated well by a Gaussian due to the following 
two reasons: On the one hand, blur is caused by diffusion-type processes, and the 
solution of the diffusion equation is just a convolution with a Gaussian. On the other 
hand, blur is sometimes caused by repeated random processes, and the central limit 
theorem again leads to a Gaussian PSF. For these reasons, Gaussians are also routine- 
ly used as PSFs in many tests of reconstruction algorithms and as first approximations 
for many devices. In such cases, only the variance of the Gaussian (often translated 
into the full-width-at-half-maximum) has to be determined, which is frequently possi- 
ble using phantom measurements. 

Another recent trend for many imaging devices is an experimental determination 
of the PSF. In such tests, very small objects (i.e. images uz with very small support 
around a point z) are used, and since these approximate a Dirac-delta at z under 
appropriate rescaling, one obtains via 


cKuz(x) ~ | k(x - y,Nötz - y)dy = k(x = 2,2) (4.8) 
Q 


an approximate read-out of the PSF from the corresponding measurements. Although 
this is a purely experimental procedure, it creates an interesting mathematical prob- 
lem due to the fact that such sources cannot be placed at an arbitrary number of 
positions in the device due to costs, time consumption, limited precision in placing 


146 — Martin Burger, Hendrik Dirks and Jahn Müller 


the sources, or other issues. In practice, one obtains a rather sparse sampling of the 
PSF and thus, the problem of PSF interpolation occurs ([7, 11, 14, 46, 62, 63, 101, 106]). 


3.2 Noise 


The modeling of noise in imaging is an interesting issue and taking into account the 
statistics of noise can indeed yield significantly improved reconstructions in many 
cases. In some inherently stochastic forward problems such as emission tomography, 
where photons are created by random radioactive decay, the modeling of noise has 
a rather long tradition (cf., e.g. [97]). In other problems, noise modeling and its use in 
reconstruction algorithms has become a very active field of research in the last years 
(cf., e.g. [10, 93]). In particular, the form of the noise has consequences for the form 
of the data likelihood and thus on the appropriate modeling of variational or iterative 
reconstruction methods. 
A frequently used standard model for the noise is additive Gaussian noise, i.e. 


fib =glp +0np (4.9) 


for each detector D with o > 0 and independent normally distributed np. Clearly, 
this yields a Gaussian distribution of the noise, and the corresponding negative log- 
likelihood is of the form 


1 2 
La(Kulf) = 303 2 (flp - Kulp). (4.10) 
D 


Asymptotically, the negative log-likelihood converges to the squared L?-norm 
D(u, f) = L(Kulf) = = [ru -fV dx. (4.11) 


In imaging devices based on photon counts, different noise statistics are in place. 
A standard model is a Poisson distribution for the counts, i.e. the number of counts 
per detector D is a Poisson-distributed random variable with mean value Ku|p. Here 
(by adding terms independent of u), the data term can be written as the Kullback- 
Leibler divergence 


D(u, f) = 5 fif log £ - f + Kuldx. (4.12) 


In the case of good statistics, i.e. a high number intensity, the Poisson distribution 
can be approximated via a Gauss distribution with the same mean and variance (both 
equal to Ku in the Poisson model). Thus, one obtains 


= 2 
Dey, f) 7; dx. (4.13) 
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Since this model is convex but still nonquadratic in u, frequently a further approxi- 
mation based on the reasoning Ku ~ f for the denominator is used, i.e. 


_ 2 
Delu, f) > Í a FY ax, (4.14) 


which can also be interpreted as a second-order Taylor expansion of the Kullback- 
Leibler divergence in Ku around f. 

Recently, a variety of different noise models have been investigated. This con- 
cerns variants of the salt-and-pepper noise ([32]), other multiplicative models (cf., 
e.g. [89]), and Rayleigh-type distributions for modeling speckle noise as, e.g. appear- 
ing in ultrasound ([55, 99]). 


3.3 Reconstruction methods 


Various reconstruction methods have been proposed over the last decades for dif- 
ferent tasks of imaging. There is a first distinction between direct and iterative re- 
construction methods. Direct reconstructions rely on exact formulas for the inverse 
operator of K and a numerical implementation of these. A standard example is the 
Radon transform, which can be inverted exactly using the Fourier transform and ef- 
ficient numerical implementations can be obtained using FFT techniques. Due to the 
ill-posedness of the inverse problem, it does usually not work to directly use the noisy 
data in the inversion, but filtering has to be used before. This leads to 


wok Fah), (4.15) 


where Fy is a filtering operation with parameter « (which is a regularization param- 
eter for the inverse problem in the sense of mollification methods, cf. [72]). Currently, 
linear filters are mainly used, which are easy to implement and analyze, though in 
principle, one can think of using nonlinear filters as well. 

Iterative reconstruction methods are usually based on a variational formulation. 
In the unregularized case, i.e. for the minimization of the functional u — D(u, f), 
one uses appropriate early termination of iterations to receive optimal results. Exam- 
ples are the simple descent method 


yet! = yk —73,D(u*, f), (4.16) 


and for nonnegative image restoration, positivity preserving schemes like the EM-type 
algorithm 
ur! = u —cu*d,D(u*, f) (4.17) 


for appropriate T, which is specifically used for Poisson noise models as 


ae) = a K* ( f ) (4.18) 


Kuk 
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Figure 4.5: Effect of noise and iteration number in iterative regularization methods illustrated by EM 
iterations on the cardiac PET software Phantom XCAT (simulated data courtesy of European Institute 
for Molecular Imaging, Münster). First row: 5 Iterations (left) and 10 iterations (right). Second row: 15 
Iterations (left) and 20 iterations (right). Third row: 60 Iterations (left) and 100 iterations (right). 
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Figure 4.5 is illustrating the so-called semiconvergence properties of such iterative 
methods, for example, here EM for Poisson noise. The iterates first approach a good 
reconstruction, but with too many iterations, noise effects enter the reconstruction 
again and distorts the image quality. In the case of additional regularization, one min- 
imizes a functional D(u, f) + R(u) instead, where R is a regularization functional. 
In the following, we shall always consider cases with appropriate regularization R, 
which prevents the instability in case of noisy data. The properties of the noise are en- 
coded in the specific form of D (as discussed in the previous section) and the weight- 
ing between data fidelity and regularization. Depending on the specific form of R, 
various iterative methods to compute a minimizer have been proposed, e.g. based on 
splitting and augmented Lagrangian methods ([23, 35, 49, 104, 107]). 


4 Missing data and prior information 


A popular trend in recent years is to consider image reconstruction with missing da- 
ta which is related to the main line of research in compressed sensing ([38, 41]). Ill- 
posedness in such problems is created in the sense of nonuniqueness of the solution 
in the inverse problem rather than by instability. The key idea to obtain good solu- 
tions to the inverse problem is to incorporate prior information. Again, variational 
methods are a standard approach and we shall discuss some favorable properties as 
well as some current limitations. 


4.1 Prior information 


Several kinds of prior information have been used to improve image reconstruction 
and, in particular, to enable meaningful reconstruction also with missing data. Nowa- 
days, the standard way of modeling prior information is (at least at a formal level) 
Bayesian modeling. Its basis is Bayes’ theorem, which yields the posterior probability 
density of u being the underlying image given the data f as 


p(flu)p(u) 

p(f) 
Here, p (f|u) is the likelihood of the data given the image u, and p(w) respectively 
p(f) are prior probabilities for the image and data. Since f is fixed, the latter is on- 
ly a scaling factor and of no particular importance. The interesting part is the prior 
probability of the image, which can encode relevant prior information. 

Several estimates can be obtained from the posterior distribution. The most fre- 
quently used and most straightforward one to compute is the Maximum a posteriori 
Probability (MAP) estimate given by 


p(ulf) = (4.19) 


û = argmaxp(ulf) i (4.20) 
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Using the equivalent minimization of the negative logarithm, we find that MAP es- 
timation can be formulated as Tikhonov-type regularization of an inverse problem, 
namely, 

ù € arg min (D (u, f) +R(u)), (4.21) 


where R(u) = — log p (u) takes the role of the regularization functional. 

While Gaussian priors, respectively quadratic regularization functionals, were 
very popular for many years due to their computational and analytical simplicity, 
a different paradigm has evolved, particularly in the last decades. In many instances, 
it was found that £!-type regularization functionals, i.e. Laplacian prior distributions, 
yield superior properties. The first such approach was the ROF model for image de- 
noising ([83]), which used the total variation as a regularization. Total variation has 
anice geometric meaning via the coarea formula. For sufficiently regular BV functions 
([44]), we have 


TV(u) = [tor < x})da, (4.22) 
R 


where 4! denotes the d — 1-dimensional Hausdorff measure. Hence, the total vari- 
ation penalizes the surface area of the level sets of the image. One observes that this 
is also true for functions u with discontinuities, as long as the discontinuity set is 
Hausdorff-measurable. The latter is a key property of total variation models. While 
regularizations based on standard Sobolev-type norms do not allow one to obtain re- 
constructions with discontinuities, i.e. images with edges, the total variation model 
can realize reconstructions with realistic edges. 

A popular alternative to MAP estimates are conditional mean (CM) estimates giv- 
en by 


ü= [uptulf)au. (4.23) 


A major difficulty for CM estimates is their reasonably efficient computation since 
a high-dimensional integration problem needs to be solved. The standard approach 
are Markov chain Monte Carlo methods. We refer to [27, 56, 57] for further details of 
those and, in general, Bayesian inversion. In the following, we shall focus on MAP 
estimation, namely, the corresponding variational model (4.21). 

There are several ways of using prior information. In a rough classification, we 
can distinguish three approaches: 
(1) General structure information, e.g. geometrical information such as a lack of os- 
cillations and smoothness between reasonable edge sets as modeled by total varia- 
tion and Mumford-Shah minimization. Related approaches are based on looking for 
sparse representations in wavelet, curvelet, shearlet, or similar frame systems. Such 
approaches are frequently used since prior knowledge is intuitive and quite minimal. 
(2) Available dictionaries of “typical” solutions. Dictionaries are learned such that 
some kind of sparse representation of the image in terms of the dictionary can be 
expected. There are two standard approaches to sparsity in dictionaries, namely, 
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synthesis-based and analysis based ([39, 82]). In both cases, one frequently uses 
convex relaxations, respectively log-concave prior distributions like Laplace distribu- 
tions on the coefficients. If a suitable dictionary is available, such approaches can be 
extremely efficient, but a major trouble, particularly in inverse problems, is that a dic- 
tionary of possible solutions is hardly accessible. In medical applications, a further 
concern about such approaches is the danger that the available dictionary does not 
include the pathologies arising in certain patients which might then be eliminated in 
the image reconstructions. 

(3) Available prior information of the local content in an additional dimension. This 
could be possible time dynamics or spectral signatures in each pixel. Again, there is 
some kind of sparsity prior that can be used here. Given a dictionary of the typical 
local content, it is natural to assume that in each pixel, there is only a mixture (mod- 
eled as linear combination) of few elements. For example, in hyperspectral images, 
one can assume that in each pixel there is only a combination of a few different mate- 
rials, and hence, a sparse mixture of material spectral signatures. From this example, 
one can also understand that the sparsity should be expected to increase with spatial 
resolution. 


In variational models of the form (4.21), it is usually beneficial to use convex func- 
tionals R for theoretical purposes as well as to avoid computational difficulties in 
computing global minimizers. Thus, sparsity priors are usually relaxed from mini- 
mizing the number of nonzero coefficients (the so-called £°-norm) to minimizing the 
¢!-norm of the coefficients. Since this step is often made in an ad hoc fashion in liter- 
ature, we give a simple explanation why the £!-norm is a reasonable relaxation in the 
following. For this sake, consider the case of an operator K acting on £! (I) with a fi- 
nite or countable index set J and assume further that some upper bound C; on each 
element u; is available, which is reasonable in most applications. Thus, we can for- 
mulate the sparsity minimization problem as a mixed integer programming problem 
of the form 


D(Ku,f) +> pi> (4.24) 


min 
je uet!(I),pıe{0,1} 


subject to the constraints 
juil <C, (1-p)u>=Dd. (4.25) 
One observes that this constrained minimization can be equivalently rewritten as 
(4.24) subject to 
[juil < Cipi, (4.26) 
since for u; + 0, only p; = 1 is possible and for u; = 0, the minimizer will clearly 


satisfy p; = 0. The straightforward convex relaxation is to replace p; € {0,1} by 
pi € [0, 1]. For the resulting problem 


D(Ku,f)+X pi> min Bun 
P uet!(T),piel0,1] 
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subject to (4.26), the optimal value of p; can be computed as p; = mal, By eliminat- 
ing this variable, we end up with minimizing the weighted (!-regularized problem 
(wi = %) 

D(Ku, f) + >. wiluil - min (4.28) 


i uel! (1), |uilsCi i 


Several recent concepts have been proposed for improvements, particularly hy- 
perprior frameworks where one has a regularization of the form R (u; p) depending 
on additional parameters for which another prior distribution is available. The MAP 
estimate then amounts to solve 


(ù, Pp) Sarg min (D(u, f) +R(u;p)+S(p)). (4.29) 


A related example is the idea of inf-convolution, which represents the image as a sum 
of two parts for which separate prior information is available. The corresponding vari- 
ational model is of the form 


(a, 0) e arg min (D(u,f) +Rılu=v) + Rov), (4.30) 


which has been of particular interest in total variation combined with a higher-order 
functional for smooth parts in the image ([15, 16, 87]). 

A systematic error of MAP estimates is to underestimate R (u), which, e.g. results 
in contrast losses in the case of total variation regularization. In order to cure this, 
the Bregman iteration has been proposed ([77]), which, instead of a single solution of 
a variational model, constructs a sequence 


uktle arg min (TD(u, f) + R(u) - (p*,u)) (4.31) 


for p* € 0R(u*) and small T. The behavior is the one of an iterative regularization 
method, and hence appropriate termination is necessary for optimal results. 


4.2 Undersampling and superresolution 


A frequently studied issue is the case of undersampled data, i.e. one tries to achieve 
a higher spatial resolution than the Shannon sampling theorem allows for the num- 
ber of measurements. Obviously, this is possible only with the use of strong prior 
information. Let S denote a sampling operator. Then, the inverse problem can be re- 
formulated as 

SKu=g =Sf. (4.32) 


In this case, the most important issue is not necessarily the noise (for rather low- 
dimensional range of S the instability decreases and, e.g. the Moore—Penrose inverse 
becomes continuous), but rather the null space. Hence, it is important to understand 
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Figure 4.6: Illustration ofthe impact of using prior knowledge in noisy image reconstruction on 

a cardiac PET scan (measurement data courtesy of European Institute for Molecular Imaging, 
Münster). Top row: EM-reconstruction with 20 minutes of data, i.e. low noise level, representing 
ground truth (left) and EM-reconstruction with five seconds of data, i.e. high noise level (right). 
Bottom row: Reconstruction from a variational model with TV-regularization (left) and improvement 
by Bregman iteration (right), both with five seconds of data. 


how one can favor the type of solutions corresponding well to the prior knowledge. 
For this sake, it is of particular interest to study the problem 


Ru)- mn . (4.33) 

u satisfying (4.32) 
In the finite-dimensional sparsity case with R being the £!-norm, the study of this 
problem has led to celebrated results of compressed sensing. Under certain condi- 
tions on the operator A = SK, which is then just a matrix with large null space, one 
can indeed uniquely reconstruct very sparse solutions of (4.32) in this way, even for 
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the low-dimensional range of S. Classical conditions, by now, are low incoherence in 
the matrix A, i.e. for two different rows A; and A; of A, one should have 


A: Ajl <1, (4.34) 


and the celebrated restricted isometry property, which requests A to be close to 
an isometry on subspaces of k-sparse inputs u. The set of conditions needed for this 
sake has been refined in the last years (cf., e.g. [28, 30, 37, 47, 95]) and extended 
also to exact reconstruction in the case of the unconstrained problem (4.21) and to 
related problems such as the reconstruction of low-rank matrices ([29]) or matrices 
with certain sparsity patterns in rows or lines (cf., e.g. [31, 108]) by convex variational 
methods. 

From the inverse problems point of view, a potential issue in applying the com- 
pressed sensing theory is the fact that it is formulated in a strictly finite-dimensional 
setting, while it is more natural to study the limit of infinite dimensions in the inver- 
sion. In particular, it would be desirable to have a theory that works at different image 
resolutions in this context, but it is neither well studied how sparsity priors apply at 
different resolutions nor how the conditions for exact recovery change when refin- 
ing the image resolution. The first issue has recently been studied ([1, 2]) and led to 
several interesting results. The second issue is more severe in the case of ill-posed in- 
verse problems. Due to the ill-posedness, it is clear that the coherence between some 
rows needs to converge to one. Also, the restricted isometry property will be diffi- 
cult to satisfy for reasonable values of the sparsity level k, even for k = 1 one can 
construct simple counterexamples for compact operators based on the singular val- 
ue decomposition. Indeed, it has been verified computationally that even reasonable 
discretizations of simple forward operators like the Radon or X-ray transform are far 
from satisfying restricted isometry properties ([40]). Hence, it seems that in the study 
of superresolution in an inverse problems setting, it is more reasonable to pose the 
question in a different way and to rather understand for the given operator which so- 
lutions are reconstructed nicely or even favored. In the sparsity setting, this would 
mean to ask on which k-dimensional subspaces conditions for exact reconstruction 
are satisfied. A condition that can be generalized to infinite dimensions is the exact 
recovery condition [47, 95] which has been used for inverse problems in [94] and [10]. 

For more general convex regularizations R like total variation, it is more difficult 
to analyze the structure of solutions. For this sake, classical concepts in regularization 
theory, like the source condition 


dq:K*S*q € ƏR (u) (4.35) 
or even more the stronger source condition 
Jq : K*S*SKw € ðR(u), (4.36) 


are useful. Such conditions were mainly used to obtain error estimates for variational 
regularizations. However, in the case of singular regularization functionals like #1 or 
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total variation, which have large subdifferentials OR (u), it turns out that error esti- 
mates like in [81] can indeed imply exact reconstruction. We refer to [10, 68] for further 
discussion and for establishing relations between source conditions and other condi- 
tions used in compressed sensing. Another recently proposed concept that allows one 
to study the exact solution is a generalization of singular vectors and singular values 
to the case of nonlinear regularization defined by 


AK*S*SKua € OR(ua). (4.37) 


The value A and the corresponding singular vector u, can also be defined to define 
and analyze properties at different scales in an abstract way. 


4.3 Inpainting 


Inpainting is a classical multidimensional interpolation problem, i.e. the restoration 
of an image ina subregion È c Q, whichis generally named as the inpainting domain. 
In the predigital era, inpainting was already carried out by restorers in arts, who used 
their conception of the overall image and the original painter’s approach to inpaint 
damaged regions in paintings. This history is also the reason for the nomenclature 
inpainting instead of interpolation. Another key difference to classical interpolation 
theory and methods usually taught in numerical analysis is the kind of prior knowl- 
edge and of the desired results. While classical interpolation methods work well for 
smooth functions and small gaps to be interpolated (which is reflected by standard er- 
ror estimates), inpainting of images again needs to preserve (or continue) edges and 
textures, i.e. the nonsmooth components. 

In a simple inverse problems formulation, we can formulate inpainting as an op- 
erator equation (4.6) with 


K:U(Q) = UQE), Uwe ulos, (4.38) 


where ‘U(Q) is an appropriate function space, e.g. BV (Q) in the case of cartoon im- 
ages. Note that K can also be written as 


(Ku)(x) = Xq\s(x) u(x), (4.39) 


where xp is the indicator function of a set D (equal to one inside and zero outside). 
This immediately implies that K has a large null space (all function supported in >), 
but behaves well (like the identity) orthogonal to the null space. 
Well-known models for image inpainting consist of minimizing a variational 
functional 
J(u) = D(u, f) + @R(u) > min (4.40) 


with a distance term D (u, f) that first projects u to the smaller support Q \ = and 
then compares it with the given data f. The regularization term R(u) specifies the 
method of inpainting on > (see Figure 4.7 for an illustration). 
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e Laplace inpainting: We set the data term D (u, f) as the squared difference from f 
and choose the regularization as the squared gradient norm to assure smooth areas 
inside the inpainting domain and obtain 


5 | Ku- f)? + a f ivul? - min, (4.41) 
O\= Q 


where K is the so-called downscaling operator. Calculating optimality conditions re- 
sults in 
K* (Ku - f) las -aAu =0 (4.42) 


and we see that for solving the variational problem, we have to calculate the Laplace 
equation on the inpainting domain > with boundary conditions that come from the 
known image. 

e TV inpainting: Again, we choose a quadratic penalty for the known data. Howev- 
er, for the regularizer, we would like to minimize the total variation of u, and hence 


1 > : 
> | Ku- f? + a f ivu - min. (4.3) 
D\E Q 


This regularizer will fill the inpainting domain with piecewise constant areas. The 
optimality conditions are given by 


Vu 
K* (Ku - V. | | 0. 4.44 
(Ku — f) lays —& Vul (4.44) 
For the implementation by a steepest descent algorithm, we obtain a nonlinear diffu- 
sion-reaction system 

ou 


— =K*(Ku-f) las +aV- | 


Vu 
= | (4.45) 


Ivul 
To avoid a singularity of 1/|Vul|, the norm is usually rearranged to |Vule = 
Ve? + [Vu]? with € being a small positive constant. 

e TV-H! inpainting: Both Laplace and TV inpainting belong to the class of 
second-order inpainting methods, where the order is given by the highest deriva- 
tive in the corresponding Euler-Lagrange optimality scheme. Second-order methods 
generally have two important drawbacks. First, they are not able to connect edges 
over large distances and secondly, a continuous curvature is not propagated from 
the image into the inpainting domain. Methods of higher order are able to fix these 
drawbacks. A method of particular interest is called TV-H~! inpainting. The image 
is inpainted via 

ou 


Er = K* (Ku - f) lax +aAp, peoTV(u) (4.46) 
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Figure 4.7: Results of different inpainting methods. First row: Original image (left) and damaged 
image with 50% missing pixels (right). Second row: Restoration with Laplace-inpainting (left) and 
restoration with TV-inpainting (right). 


where OTV (u) denotes the subdifferential of TV (u). The element p is approximated 
by V - [Vu/|Vule] where |Vu|. is again a smoothed version of |Vu| (see above). 
For the implementation of the resulting PDE 

ou 


OU x u ; Vu 
T = K*(Ku - f) lays +aAV =|. (4.47) 


we refer to ([86]). 


An even more recent problem is the inpainting of videos. It poses further chal- 
lenges on computation and modeling, but also offers more prior information. An ex- 
ample is the inpainting of damaged parts in single frames of a video, where clearly the 
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previous and subsequent frames can be used to gain information. We refer to [20, 36] 
for further information. 


4.4 Surface imaging 


As already mentioned in the case of tomography, it is only possible to acquire mea- 
surements related to images on outside surfaces in most cases, particularly in medical 
imaging devices. This means the data are effectively taken on a surface, while the un- 
known image is a function in the inside volume. In tomography, the dimensionality 
of the data is matched with the one of the image by the additional rotational degree 
of freedom, but in some modalities, this is not the case. Limited-angle tomography, 
e.g. with C-bow devices or electron tomography, are already borderline cases being 
severely ill-posed, that is, underdetermined. Extreme cases are optical tomography 
(fluorescence or bioluminescence) or electromagnetical imaging (EEG/MEG, that is, 
ECG/MCG). In the optical case, data can be acquired only for a few different frequen- 
cies and a few different angles, if at all. In the case of electrical and magnetic data, 
one has no further options to obtain data. Additional prior information can be due 
to physiological considerations on the one hand, e.g. sparsity of sources in space in 
some optical investigations or in EEG/MEG, and anatomical prior information from 
X-ray, CT, or MR information on the other hand. The latter can particularly restrict the 
support of the unknown image to the relevant structures. 

As a simplified example that well reflects the mathematical issues in such prob- 
lems, let us consider a source reconstruction problem for the Poisson equation, i.e. 


-Av =u inQ c Rè, (4.48) 


with homogeneous Neumann boundary conditions ce = 0 on ðQ. The forward oper- 
ator K : L5 (Q) — L2 (dQ), given by the map u > v, where v is the unique solution 
of (4.48) with 
Í vdo =0. (4.49) 
aa 


By L? (Q), we denote the subspace of L” (Q) consisting of those functions with 


Ju dx =0. (4.50) 
Q 
Obviously, the forward operator K has a huge null space, including, in particular, 
the Laplacian of any compactly supported smooth function. In order to understand 
how the null space is affected by a variational regularization used to incorporate prior 
information, we again consider the problem 


R(u) > min subject to Ku = f . (4.51) 
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This problem can be formulated as a constrained minimization problem with the La- 
grangian 


L(u,p,q) = R(u) + fo —f)pdo + Jove -Vq—fu)dx. (4.52) 
3a Q 


If there exist Lagrange-multipliers p and q, which corresponds to the so-called source 
condition in regularization theory ([21, 42]), then they solve 


q € OR(u) (4.53) 

—Aq=0 inQ (4.54) 
oq 

DE = 0 on dQ. (4.55) 


Now, let us consider some regularization functionals R and their impact. We shall 
denote by w a weight representing anatomical prior knowledge, i.e. w(x) is large 
if x is likely to be an element of the support of u and w(x) = 0. We consider the 
following cases: 

e Minimum-Norm Solutions: This case, usually used if no specific prior information 
is available, corresponds to 


R(u) = 5 Jw dx. (4.56) 
a 


One easily checks ðR (u) = {u} and thus (4.53)-(4.55) is satisfied if u is a harmonic 
function in Q. Due to elliptic regularity, the reconstruction will be smooth inside Q 
and cannot have compact support. Moreover, note that by the maximum principle 
for harmonic functions, u attains its maximum on ðQ. One observes that the latter 
explains the so-called depth bias frequently observed in such problems ([26, 61, 70]), 
i.e. the minimum norm solution shifts the mass of the reconstruction towards the 
surface. 

e Weighted-Norm Solutions: Including the prior by the weight w in the L?-norm, we 
have 


R(u) = = | — dx. (4.57) 


One easily checks 0R(u) = {=} and thus (4.53)-(4.55) is satisfied if u = wq for 
a harmonic function q in Q. The weighting can clearly reduce the depth bias depend- 
ing on w. Inside regions of homogeneous weights w, the reconstruction u is still har- 
monic, and thus the maximum principle holds for u. Hence, the mass is still shifted 
towards the outside surface as much as possible. 

e Sparse Solutions: In order to obtain solutions with very small support, it is natural 
to use L!-type priors, i.e. 


R(u) = Í lul] dx. (4.58) 
Q 
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Formally, the subdifferential is given by ðR (u) = {s}, where 


1 ifu(x)> 0 
S(x)4=-1 ifu(x) <0 (4.59) 
€ [0,1] ifu(x) =0 


is the multivalued sign. 

Condition (4.53)-(4.55) is satisfied if s is a harmonic function in Q. Again, by the 
strong maximum principle, s is either constant or attains its maximum and minimum 
only at the boundary. In the first case, we need to further distinguish three cases: If 
the absolute value of the constant is less than one, then u vanishes everywhere. If the 
constant equals one, then u is nonnegative everywhere and needs to be equal to zero 
again due to the vanishing mean value. If the constant equals minus one, an analo- 
gous argument holds. Clearly, the absolute value of s cannot be larger than one, and 
thus in any case, u = O if s is constant. If s is not constant, it attains its maximum 
and minimum on ðQ and thus, the absolute value of s is necessary for less than one 
in the interior of Q, which means again u vanishes there. The consequence is that u 
needs to be concentrated on 0Q, which of course does not work with an L!-theory, 
but can be made rigorous in a usual way by considering Radon measures and their 
total variation instead of L'-functions and their norm. This way, we observe the ex- 
treme consequences of the depth bias on sparsity, that is, the solution will always be 
concentrated at zero depth. 

e Weighted Sparse Solutions: Again, L'-type priors can be weighted using anatom- 
ical prior information, i.e. 


R(u) = Í m dx. (4.60) 
2 


Formally, the subdifferential is given by ðR (u) = { a }, where s is a multivalued sign 
of u. Now, (4.53)-(4.55) is satisfied if s = wq for a harmonic function q in Q. For 
appropriate weights, s can achieve its minimum and maximum inside the domain Q 
such that the support is not necessarily on the outer surface. However, the possible 
maximum still strongly depends on the properties of w. For homogeneous regions, 
a maximum will again be on the part closer to the outer surface, and thus a depth 
bias prevails. Some depth bias can be eliminated if w is scaled with the operator, 
respectively. 

e Total Variation Regularization: If we use total variation regularization, formally 


R(u) = Í |vuldx, (4.61) 
Q 


then the subdifferential contains elements of the form V -g, with g = + al on smooth 


parts with nonvanishing gradients. Basic arguments in differential geometry imply 
that indeed V - g is the mean curvature of level sets of u. Now, since V - g is harmonic, 
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we need to expect that the maximal curvature is attained on 00. Hence, we cannot 
expect to reconstruct small features (with large curvature) correctly with increasing 
distance from the measurement surface. 


From the above discussion, one observes that all standard approaches to incor- 
porate prior knowledge suffer from severe shortcomings in the application to imaging 
from (underdetermined) surface data. It remains an important future challenge to de- 
velop improved approaches that can provably reconstruct structures corresponding 
to the available prior knowledge. 


5 Calibration problems 


In the last decades and years, significant technical improvements have been made 
in existing devices and new modalities have been invented such that resolution is 
continuously improving. Novel devices to increase spatial resolution and fast mea- 
surements to increase time resolution lead, however, to a novel kind of mathematical 
problems which we want to summarize under the term calibration problems in the fol- 
lowing. The major issues are that a good characterization of the device properties (e.g. 
the PSF of a microscope) is not (yet) possible or depends on the subject to be imaged, 
or that the need to take fast measurements does not leave enough time to calibrate 
the device well (e.g. coils in fast MR imaging). 
The resulting mathematical structure is typically of the form 


K(p)u =f, (4.62) 


now with K(p) a linear operator depending (possibly in a nonlinear way) on the pa- 
rameter (functions) encoded by p. Even if the dependence on p is linear, the overall 
inverse problem becomes nonlinear, often a bilinear problem, which is clearly more 
difficult to solve than the linear inversion for given p. Moreover, even if the problem 
for given p can be overdetermined, the joint reconstruction of u and p may be under- 
determined and thus again enforces the use of appropriate prior information. Clearly, 
the prior information on the parameters is quite different than the one on the image. 
Usually, good mean values are available for p as well as a strong perception of spa- 
tial smoothness, which means that such functions are usually modeled as elements 
in Sobolev spaces (often of high order), with a small distance to the given prior value. 
Moreover, other structural constraints such as nonnegativity can be available. 
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5.1 Blind deconvolution 


A classical problem of the above type is blind deconvolution, in its original version 
with p being the point spread function itself, i.e. the bilinear forward operator 


[K(p)u](x) = [pix- yuvay, (4.63) 
Q 


where u is a single image or even a vector of images. Due to the obvious underde- 
termination of the nonlinear inverse problem, it is essential to use appropriate prior 
information on the kernel p and the image u. An obvious constraint is nonnegativity 
and a scaling property of p such as fga p(x)dx = 1. Further knowledge is usual- 
ly introduced via appropriate regularization, particularly by minimizing functionals 
like 

D(K(p)u, f) + &ıRı (u) + &2R2 (p), (4.64) 


where D is a standard distance functional such as the squared L*-norm or the 
Kullback-Leibler divergence. The functionals R; are different regularization terms 
with regularization parameter «;. 

In many instances, the blind deconvolution problem can be modified with ad- 
ditional modeling of the point spread function. Prominent examples are the phase 
effects appearing in various optical imaging modalities from astronomy down to 
nanoscopy. For the phase being the parameter to be determined, we have 


[K(p)u](x) = [kæ -y poud, (4.65) 
Q 


with a given form of the kernel k, usually 
f 2 
k(x, p) = ko(x) |E (x) - eP Eo(x) |", (4.66) 


where Eı and E2 become the counterpropagating fields. Using variational approaches 
like (4.64) even with quadratic priors for the image u and (in higher Sobolev spaces) 
the phase p, a sufficiently good estimate of the phase can be found. It has been 
demonstrated that this phase can be used to obtain reconstructions of superior qual- 
ity by advanced total variation reconstruction methods ([90]). Such a two-step ap- 
proach is indeed tempting for many calibration problems since one can here benefit 
from the fact that the forward operator is not too sensitive with respect to the param- 
eter p. Hence, even a rough estimate of p is sufficient for strong improvements in 
the estimation of u, which, in the end, is indeed the quantity of interest. A thorough 
mathematical analysis of such a two-step procedure is still missing. 
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5.2 Nonlinear MR imaging 


A class of calibration problems that has gained high interest recently are those arising 
from fast measurements in MRI. In the standard setting, one obtains MR-data from the 
model 
f(t) = [ucapira dx, (4.67) 
Q 

where s is the known (precalibrated) coil profile and k(t) encodes the specific tra- 
jectory used for MR-measurements. If one tries to obtain fast measurements, there 
is often not enough time to calibrate the coils accurately, and hence p = s is to be 
treated as an unknown. Thus, one ends up with a bilinear inverse problem, however, 
with a good prior so(x) from the precalibration. Deviations of s from sọ can be ex- 
pected to be small and smooth, and hence one can use a smoothness prior with high 
regularization parameter on s — So. 

In some cases, further effects such as relaxation or field inhomogeneities become 
relevant, a more appropriate forward model is then given by 


f(t) = Kara Reet aniktes dx, (4.68) 
a 


where RX is a relaxation time and w models the local field inhomogeneity. Potential 
candidates for the parameter p are the coil sensitivities ([96]), the field inhomogene- 
ity ([92]), and the relaxation time ([76]). So far, there exist few, rather practical, ap- 
proaches to the solution of these nonlinear underdetermined inverse problems. A de- 
tailed analysis highlighting the potential and limitations of the joint reconstruction is 
an important future task. 


5.3 Attenuation correction in SPECT 


As mentioned in Section 2.3, SPECT imaging has a challenging structure with respect 
to attenuation. The forward operator is of the form 


(Ku)(z, 0) = | e hse PAY y (x) dx, (4.69) 
L(z,0) 


where by L(z, 0), we denote the line starting at z in direction 0 and by L.(z, 0), 
the line segment between z and x. Note that the tracer density u is different from the 
(scaled) physical density p, and thus the latter is usually determined by an X-ray scan 
before the SPECT measurement. 

In several instances, e.g. in the case of patient movement or dynamic imaging of 
moving objects, the attenuation determined initially does not remain valid. Thus, it 
becomes necessary to reconstruct the attenuation density p together with u, i.e. the 
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inverse problem becomes nonlinear and of the form (4.62) with p = p. The additional 
degrees of freedom do not necessarily lead to an underdetermined problem since the 
original SPECT data set from all directions indeed overdetermines u. However, the 
solution of the problem remains challenging from a theoretical perspective ([6, 84]) 
as well as from a computational point of view. For the latter, various iterative tech- 
niques have been investigated ([79]), the most natural being of course an alternating 
minimization technique of a functional like (4.64). 

A different approach is to partly use the previously measured function p and in- 
stead measure the deformation that appeared before the PET scan. Thus, the param- 
eter p is a vectorial quantity, with a strong prior of p being close to the identity. The 
forward model thus becomes 


[K(p)u](z, 0) = Í eher POD AVY (x) AX, (4.70) 
L(z,0) 


with p given. Already with simple parameterizations, one can obtain significant im- 
provements ([102]). With advanced techniques of nonlinear image registration ([67]), 
further steps have been made recently ([8]). 


5.4 Blind spectral unmixing 


With the recent advances in multi- and hyperspectral imaging, the unmixing of spec- 
tral signals into basic components has received increasing attention. Blind spectral 
unmixing is a classical problem in audio applications. Some striking examples are 
the decomposition of party talk into single person statements and the decomposition 
of an orchestra recording into the different instruments. In the imaging context, one 
usually seeks a decomposition of the spectrum into spectra of basic materials to ob- 
tain a good characterization of the content of a certain region. 

In discrete modeling, the spectral image is a matrix F € RNM, where N is the 
number of pixels (or voxels) and M is the number of spectral points. The spectral 
unmixing looks for a coefficient matrix U € RN**, where U; j is the coefficient with 
respect to the j-th spectral basis function in pixel i. By collecting the basis spectra in 
a matrix B € R**™, one thus has to solve the matrix equation 


UB =F. (4.71) 


While B is given in the classical unmixing problem, it is an unknown itself in the 
blind unmixing or blind separation problem. Since the data F as well as U and B 
have naturally nonnegative elements, solving (4.71) can also be cast in the framework 
of nonnegative matrix factorization. In the above framework, we have u = U, p = B 
and K(p) being the multiplication operator. One also observes the relation to blind 
deconvolution problems, whose discrete version is a special form of blind unmixing 
with B restricted to the class of Toeplitz matrices. 
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A particular property in hyperspectral imaging is a spatial correlation between 
the pixels which can be modeled in the regularization in order to decrease the 
nonuniqueness in unmixing. With the popular TV prior, this naturally leads to 


IUB -— Fl? + & X TV(U.;) +B > R(Ui) + yS(B), (4.72) 
3 l 

where U.; denotes the j-th column and Uj. the i-th row of U. Moreover, TV denotes 

the total variation of a discrete image, the functional R is a local prior in each pixel, 

e.g. the £!-norm to enforce sparsity with respect to the basis, and S is a functional 

that models specific prior knowledge on the basis elements, e.g. an £!-type norm to 

enforce sparsity in a certain basis. 

So far, most of the analysis of blind unmixing is carried out in finite dimension, 
thus rather for ill-conditioned than ill-posed inverse problems. However, with increas- 
ing spatial and spectral resolution of imaging devices, it becomes interesting to study 
the asymptotics of unmixing problems as N and K tend to infinity (independently or 
in appropriate relative scaling). Useful reconstruction approaches and algorithms cer- 
tainly should be characterized by a robust behavior with respect to the asymptotics. 
Such modeling of the asymptotics and different spatial resolution is also relevant if 
hybrid imaging is used. In several cases, the hyperspectral data are acquired with low 
spatial resolution at the same time as a conventional color image at high spatial reso- 
lution. The superresolution in the hyperspectral image based on the correlation with 
the color image is a challenging inverse problems; we refer to [69] for further details. 

In addition to the pure unmixing, an interesting inverse problem is to study joint 
image reconstruction and unmixing. With a forward operator acting on the pixel di- 
mension, the problem becomes 

AUB =F, (4.73) 


with a given matrix A. 


6 Model-based dynamic imaging 


An ultimate goal in a variety of modern imaging approaches is to obtain (quantita- 
tive) information about dynamics instead of only still images. Roughly speaking, this 
means that instead of a single image u, a whole sequence u(t) for varying time t 
needs to be reconstructed and its dynamics needs to be analyzed. The inverse prob- 
lem of reconstructing dynamic images can usually be formulated as 


Ku(t) = f(t), te[0,T] (4.74) 


since the forward operator is hardly changing with the dynamics. In several instances, 
the time resolution is so low that it seems more appropriate to consider a time discrete 
model 

Ku(ti) = f (ti), t= Ayteiny MM, (4.75) 
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It is obvious from the fact that K does not depend on time that the image recon- 
struction problem can be split into several stationary reconstruction steps at differ- 
ent times, making all standard methods applicable. However, important information 
is lost this way. In general, the images u(t) and consequently also the data f(t) 
are strongly correlated in time since they are typically generated by a smooth time 
evolution rather than arbitrary changes. One way to incorporate this kind of prior 
information is to use regularization functionals that penalize large changes in time. 
Frequently used examples (mainly due to their simplicity) are 


T T 
1 2 
R(u) = J Rotu) dt + 3 Í loru(t) ||” dt, (4.76) 
0 0 


where Ro is a regularization functional in space, respectively 


M 

R(u) = >! Ro(u(ti)) 7 ut) — u)? . (4.77) 

i=1 

Besides such all-purpose approaches, a different paradigm taking into account the 
mathematical modeling of the underlying dynamics has evolved. The correlation is 
guaranteed by using ODE (ordinary differential equation) or PDE (partial differential 
equation) models appropriately describing the dynamics, usually with unknown pa- 
rameter functions to be reconstructed. The image sequence is obtained implicitly by 
solving the forward model with reconstructed parameters. Since either good priors for 
those parameter functions exist or they are of lower dimensionality (e.g. independent 
oftime), thus making the inversion overdetermined, improved reconstructions can be 
gained from such approaches. The main bottlenecks are the mathematical difficulty 
and computational challenges compared to separate reconstructions at different time 
steps. Instead of reconstructing a series of images from a linear stationary forward 
problem, one now has to identify parameters in nonlinear time-dependent differen- 
tial equations, which, as a further complication to well-known parameter identifica- 
tion problems, have to be combined with the forward operator ofthe imaging system. 
For these reasons, the majority of such approaches, with some exceptions for reason- 
ably simple forward problems, are rather at the level of basic mathematical research, 
but they have high potential to lead to practical advances. 


6.1 Kinetic models 
Kinetic models are used to model biochemical effects or also as coarse descriptions 


of the diffusion and exchange of blood traced in examinations with emission tomog- 
raphy ([103]). The majority of such models uses first-order kinetics, i.e. the dynamics 
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of the image is given by 
k 
u(x,t) = >, wj(x, Pœ) ux, t) + wolx,P(x))IE), (4.78) 
j=l 


where wo and wj are weights modeling the fraction of the components in different 
subregions. Here, the vector U = (uj) of different states follows an ODE system of 
the form 

aU (x,t) = A(x, P(x))U(x,t) + B(x, P(x)) I(t), (4.79) 


where P is the vector of unknown parameters, A and B are matrices in Rk** depend- 
ing linearly on P, and I is a vector of input functions, which we assume to be given 
here (in practice, they are sometimes estimated from data first, which is a different 
issue). 

A simple example is the one-compartment model for perfusion used in positron 
emission tomography (PET) with radioactive water (H2!°O), that is, a tracer which 
follows the blood flow. The tracer activity in the heart, which is the image in PET, can 
be written as 

u(x,t) = wo(x)Io(t) + wı(x)uı (x,t), (4.80) 


with a concentration of the tracer in tissue u] and the (homogeneous) arterial con- 
centration Io. The weights wo and w correspond to the respective volume fractions 
of arteries and tissues and can be written as 


Wo(x) = X(x)p3(x), wı(x) = X(x)(1 = p3(x)) (4.81) 


where y is an indicator function of the heart, namely, the region containing blood, 
which we again consider as given. The ODE system describing the dynamics is given 
by 

Orui(x,t) = -pı (x)u (x,t) + po(x)I(t). (4.82) 


The model-based inversion now looks for the parameter vector P = (p1, p2, p3) re- 
lated to the perfusion of tissue (in ml blood per second per ml tissue) and the tissue 
fraction using the above model equations. In this case, the parameters themselves 
are more interesting than the image sequence anyway. The current state of the art is 
to first reconstruct the image sequence u(t) and then extract parameters in regions of 
interest in order to obtain a quantitative analysis of perfusion. Due to the inherently 
high noise in time-resolved PET, the reconstructed images are of rather low quality, 
which limits the success of subsequent parameter estimation, in particular the spa- 
tial resolution. By directly inverting for the parameters ([12]), one obtains significantly 
less degrees of freedom than by inverting for the image sequence, which allows one 
to increase the spatial resolution. Let us also mention that the above time-continuous 
modeling seems appropriate for data acquisition in a list-mode format, i.e. for all de- 
cay events, the exact time is recorded and saved such that the data can be interpreted 
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as a Poisson sampling from the time-continuous forward projected image sequence. 
If one works with rebinned and gated data, the discrete modeling in time is more ap- 
propriate. One only obtains information about K time intervals, and the model of the 
image at time t; instead becomes 


t; 
u(x, ti) = Í (wo(x)I(t) + wı(x)uı(x,t))dt. (4.83) 


ti-ı 


Similar problems arise in dynamic SPECT and MR ([3, 50]). 

At a first glance, it is natural to solve for the parameters in (4.74), (4.78), (4.79) in 
the framework of a nonlinear parameter identification problem. On the other hand, 
using prior information for possible values of P, it is often possible to partially dis- 
cretize the parameters, and since (4.79) is usually a simple system of linear ODEs, 
it allows for explicit solutions in many cases. Using these explicit solutions for a dis- 
crete set of parameters is the basis to rewrite the identification as a basis pursuit prob- 
lem, which we discuss as an alternative approach in Section 6.3. 


6.2 Parameter identification 


The variational formulation of the nonlinear inversion as a parameter identification 
problem is rather straightforward. In the case of noisy data, we can minimize a com- 
bination of the log-likelihood with regularization functionals R acting on the param- 


eters P, i.e. 
T 


Af L(FWIKuCe)) dt + R(P) (4.84) 
0 

subject to (4.78), (4.79). Standard priors for the parameters again lead to spatial 

smoothness, possibly with edges, such that Dirichlet energies or total variation are 

useful choices. Several authors have used such approaches in studies in emission to- 
mography ([13, 58, 105]) and the numerical results confirm significant improvements 
in results. 

The two major questions related to analysis and numerical solution are the fol- 
lowing: 

e Analysis: Provide estimates (in dependence on the noise level) confirming and 
quantifying the gain of quality in reconstructions when using the nonlinear in- 
version scheme instead of linear reconstructions of u with subsequent parameter 
estimation in every point x. 

e Numerical Solution: Construct numerical schemes to solve the inverse problem 
efficiently in three spatial and one time dimension. 
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So far, the first issue is completely open. Although several approaches to error 
estimation for regularization methods for nonlinear inverse problems exist, the ap- 
plication to the above problem is not straightforward due to the combination of the 
spatial operator K and the time dynamics. However, an even more severe issue is the 
comparison to the simpler approach of first reconstructing an image sequence and 
then estimating parameters for which no advanced concepts in inverse problems are 
available. The study of such questions, possibly also combined with statistical noise 
models as in emission tomography, is highly relevant for future research however. 

With respect to the efficient numerical solution, further advance has been made 
recently. In designing computational algorithms, several goals and also limitations 
have to be taken into account. First of all, the operator K, respectively its discretiza- 
tion as a matrix, is rather complex and can usually neither be stored nor inverted effi- 
ciently. Thus, an algorithm for solving the inverse problem should be based mainly on 
the application of K and its adjoint K* instead of solving large linear systems includ- 
ing K. Secondly, the problem dimension will be huge if space- and time-dependence 
are taken into account simultaneously. Thus, it seems more appropriate to use split- 
ting algorithms which can iterate in an alternating way between an image reconstruc- 
tion and a parameter identification step. A further complication can arise due to the 
spatial regularization on the parameters which additionally couples the parameter 
estimation step and might enforce further splitting. 

In order to highlight the structure and couplings, let us derive the first-order opti- 
mality conditions for the inverse problem in a constrained formulation. Thus, we look 
for saddle points of the Lagrangian 


T 
L(u,U,P;v,w) = Af Lp@ikuce)) dt + R(P) 
0 
T K 
+f (un = > wi(x,P(x))uj(x,t) 
02 j=1 
+ woi PoI | v(x,t)dxdt 


+ 


og 


fau, t) — A(x, P(x))U (x, t) 
Q 


+B(x,P(x))I(t)) -w(x,t)dxdt. 
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First-order optimality is given by vanishing first derivatives ofthe Lagrangian, i.e. 


0 = uL = K* OxuL(f(t)|Ku(t)) + v(t) 
0 = ðyL WU — w +A w 


LFR 
0 = dpL= -Í (È oros + OpWol)v + (dpAU — OpBI) - w) dt + R' (P), 
0 \=1 


where w = (w1ı(x,P(x)),...,w@g(x,P(x))) is the vector of weights. One observes 
that the way we introduced constraints naturally separates the image reconstruction 
and the parameter identification steps: The optimality with respect to u can be con- 
sidered as an image reconstruction problem for u at each time step t. The forward 
equation for U together with the adjoint equation arising from the optimality with re- 
spect to U and the optimality with respect to P constitute a parameter identification 
for an ordinary differential equation in a Banach space. Thus, in algorithms, it is nat- 
ural to split these two subproblems, e.g. by Augmented Lagrangian methods (ADDM). 
If done appropriately, this usually permits one to use available methods for static im- 
age reconstruction at each time step in the first part. For the parameter identification, 
one observes that all parts, except potentially R’(P), are purely local on each pixel, 
and hence one obtains systems of decoupled ODEs in each pixel, which can be solved 
efficiently and parallelized in a trivial way. If R’ is local, like for LP-penalties, then one 
directly obtains algorithms with reasonable efficiency this way. If R’ is a differential 
operator in space, like in regularization with total variation or Sobolev norms, then 
a further splitting based on doubling the parameter, i.e. a novel constraint Q = P 
seems reasonable. If the splitting is performed such that P appears in R’(P), but the 
coupling to the U and w is via Q, one again obtains efficient algorithms, leaving the 
ODEs local in space. 


6.3 Basis pursuit 


As an alternative approach that receives increasing attention in emission tomography, 
we consider a basis pursuit solution which we discuss for simplicity in the special 
case of the single compartment model (4.82), which we rewrite for simpler notation 
as 

Oru (x,t) = -alx)v(x,t) + b(x)I(t), (4.85) 


subject to initial conditions v (x,0) = O (usually modeling injection of the tracer at 
time 0) and overall concentration given by 


u(x,t) =c(x)v(x,t) + (1—c(x))I(t). (4.86) 
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The differential equation can be solved easily, yielding 


t 
u(x,t) = c(x)b(x) ji een YI(s)ds+(1—c(x))I(t). (4.87) 
0 
Now, a further key step is to discretize the parameter a into a set of possible values 
aA1,...,‚an € R+. Thus, we can write the image as 
-> oj (x)@j(t), (4.88) 


with unknown coefficients &;(x) and time basis functions 


t 
Pod =, pyle) = [e61 as, (4.89) 
0 


which can be precomputed. We need to keep in mind, however, that (4.88) is equiva- 
lent to the previous form only if the following conditions are met for each x € Q: 


Xj(x) =O, aæo(x)<1 l&i... Nn) |] =1. (4.90) 


If these conditions are met, we can reconstruct the parameters in the original model 
via 

ay (xX) 
1 — &o(x) 


a(x) =ayx) b(x) = x) =1-a0(x), (4.91) 


where J (x) is the index such that aj(x) # 0. 

A particularly attractive feature of the basis pursuit formulation is that the for- 
ward model is now linear and has some separation of spatial and temporal features, 
i.e. 


N 
2 (Kaj) p;(t). (4.92) 


The major challenges - as usual in me pursuit - come from the sparsity constraint 
in (4.90). One heuristic approach is to ignore the constraint and consider nonsparse 
decompositions or subsequent thresholding (cf., e.g. [80]). For low data quality, one 
however loses the disadvantages of the modeling approach and the reconstructions 
can become rather arbitrary. An alternative is to investigate convex relaxations, as 
frequently used in compressed sensing. This means that the sparsity constraint is 
usually formulated as a penalty (respectively regularization) and then relaxed from 
the nonconvex £ to the convex ¢!-norm. The case of coefficient vectors with only one 
nonzero entry is usually the easiest one to deal with in compressed sensing. Exact 
reconstruction is possible, even with some data noise if the basis functions qj are 
normalized in a Hilbert space scalar product ([51]), which is easy to achieve (e.g. by 
redefining the coefficients). 
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Unfortunately the latter argument does not apply directly to the inverse problem 
in (4.92) since one has to solve many inverse problems with sparsity constraints for 
every x, which are coupled by the operator K. The appropriate sparsity prior is thus 
ofthe form 

lall, = sup IIx(x) leo , (4.93) 


xEQ 


where «(x) is the vector of coefficients. A convex relaxation is given by 


lall, = sup lla(x) |e, (4.94) 


xen 


which motivates one to further study problems of the form 
T 
à | L(fiku) dt + lalio, (4.95) 
0 


subject to (4.88). 


6.4 Motion and deformation models 


A particularly important process in many applications is motion, and thus also its 
modeling receives growing attention in image reconstruction. There are two main as- 
pects of motion in imaging: It can either simply cause disturbances of the images, 
e.g. as motion blur or by motion of the imaged subject between two time frames, or it 
can be the process of interest itself, e.g. in quantifying flow behavior. In any case, it 
is important to use appropriate models for motion, namely, deformations introduced 
by them. Using motion models, the problem is related to classical motion estimation 
in image sequences, e.g. via the celebrated optical flow ([5, 54]). Using deformation 
models between different time frames is related to image registration or fusion ([67]). 
Let us start with motion models corresponding to flow dynamics. If the image is 
modeled via its evolving density in three spatial dimensions (e.g. tracers in fluores- 
cence microscopy, emission tomography, or MR), then it is appropriately modeled via 

the transport equation 
%u+V-(Vu)=0O (4.96) 


in Q x [0, T], where V is a velocity vector field to be determined. We mention that 
in the case of an incompressible substance, the standard relation V - V = 0 holds, 
which reduces the degrees of freedom. 

A variational reconstruction scheme including the motion model is then of the 


form 
T 


T 
Dun, FO) dt + [ruv dt — min, (4.97) 
0 0 i 
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subject to (4.96), where R is a regularization functional for density and velocity. Such 
a formulation is related to several classical problems, e.g. the fluid-dynamic formula- 
tion of optimal transport ([9]) or optimal control formulations of optical flow ([18, 75]) 
in the incompressible case. At first glance, the use of (4.96) seems an unnecessary 
complication in the problem since it increases the degrees of freedom from a scalar 
function to a vector field and it makes the overall reconstruction nonlinear due to the 
bilinear constraint in u and V. However, the transport formulation yields important 
advantages: First of all, the correlation between different time steps is appropriately 
modeled and using regularization functionals that prevent overly large velocities, it 
is possible to coherently follow the motion. Moreover, by determining u and V, one 
obtains a quantification of the flow together with the image reconstruction. 

In [17], a reconstruction approach using optimal transport regularizers has been 
developed using 


2 
Rue KE +[ [iver (4.98) 


with particular focus on the total variation case p = 1. For pq > 1, the existence of 
a minimizer can be shown and the minimization is convex for standard choices of L. 
An alternative to flow models are deformation models which rather correspond to 
the usual Lagrangian approach in solid mechanics. In this case, we use a deformation 
y: Qx [0,T] — R? instead of the velocity field V and obtain a solution of (4.96) as 


u(x,t) = uo(y(x, t)) det (Vy(x,t)), (4.99) 


where uo = u(x, 0) is the initial value. The variational reconstruction scheme in this 
case can be formulated as 


T 

à | D(kuie), f0) dt + [ Ree), 90) dt > min (4.100) 

Wy 
0 0 


with u given by (4.99). Since u is given by an explicit formula in terms of uo and y, 
the minimization can be carried out with respect to the latter two variables. Such 
an approach was taken by [65] for PET. We also refer to [91] for a recent study with 
hyperelastic regularization of the deformation. 

Note that the main difference in the properties of minimizers in the Eulerian (4.97) 
and Lagrangian (4.100) approaches comes from the way regularization and thus pri- 
or knowledge is introduced. In the Eulerian approach, velocities are penalized over 
time, and hence the goal is an efficient flow as in fluids. In the Lagrangian approach, 
deformations are penalized, e.g. by elastic or hyperelastic energies, which rather cor- 
respond to typical situations in solids. 
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6.5 Advanced PDE models 


So far, advanced fluid models like Navier-Stokes equations for fluids or reaction-dif- 
fusion systems are rarely used. Although, in several cases, advanced models are avail- 
able. The reasons for not using such models in imaging are twofold: First of all, the 
computational complexity of inverse problems strongly increases with the complexi- 
ty of the forward models and it is often not clear if the results can be so significantly 
improved that it justifies a strong increase in computation time. The second reason 
is that including more complex models also potentially increases the model uncer- 
tainty. The reason is that with each new model part, additional parameters and mod- 
eling assumptions are introduced. Take, as a simple example, the quantification of 
intracellular fluid flow from 4D fluorescence microscopy data. Standard flow estima- 
tion algorithms simply use the transport equation for the density of the fluorescence 
tracer with some regularization on the velocity field. One could, however, use the in- 
compressible Stokes model for the fluid flow and estimate the force field, for which 
good prior knowledge is available. However, using this advanced model, one needs 
a further assumption of incompressibility and introduces the viscosity as a further 
uncertain parameter, and hence the overall uncertainty of the forward model is in- 
creased. 

Besides these issues, there is still growing interest in using advanced PDE models 
in several fields of image reconstruction and analysis (cf., e.g. [45, 52, 53]), and a large 
amount of research in this direction is to be expected in the next years. 
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Abstract: In the past two decades, regularization methods based on the fı norm, in- 
cluding sparse wavelet representations and total variation, have become immensely 
popular. So much so, that we were led to consider the question whether {)-based 
techniques ought to altogether replace the simpler, faster and better known {2-based 
alternatives as the default approach to regularization techniques. 

The occasionally tremendous advances of £,-based techniques are not in doubt. 
However, such techniques also have their limitations. This article explores advan- 
tages and disadvantages compared to f>-based techniques using several practical 
case studies. Taking into account the considerable added hardship in calculating so- 
lutions of the resulting computational problems, £; -based techniques must offer sub- 
stantial advantages to be worthwhile. In this light, our results suggest that in many 
applications, though not all, £2-based recovery may still be preferred. 
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1 Introduction 


Ill-posed problems typically require some regularization in order to compute a credi- 
ble approximate solution in a stable, well-defined manner. In this article, we consider 
such problems where the objective is to recover a function u(x), with x € Q C Rİ 
(typically d = 2 or d = 3), from observed and discrete data b. Given is a forward 
operator, F (u), which predicts data for any suitable function u, and the challenge is 
to find u such that the predicted data match the observed data to within a reasonable 
tolerance. 
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It is convenient for our discussion at this point to consider a linear forward oper- 
ator, with u discretized on some mesh in Q and reshaped as a vector of unknowns u, 
and with the observed and predicted data likewise written as b and F(u) = Ju, re- 
spectively. Here, J isan m xn sensitivity matrix, m < n, which often has a nontrivial 
null space. Then, we write down the Tikhonov-type regularized problem [25, 60, 61] 


1 
min > IJu-bll; + BR(w), (5.1) 


where || - ||» denotes the usual vector f, norm, ß > 0 is a parameter, and R is a regu- 
larization operator. We focus on the following possibilities for R: 
(1) Consider 


Ru) = 5 wulg , (5.2) 


for the choices p = 1 (referred to as L1) or p = 2 (referred to as L2). Here, W isannxn 
weight matrix, e.g. some wavelet or curvelet transform, or just the identity [10, 24, 33]. 
For notational purposes, we stipulate that W is not a discretized gradient operator.! 
(2) Recalling that u represents a discretization of a function u(x) on Q, choose R (u) 
to be an appropriate discretization of 


R(u) = =| (va, (5.3) 
P 


again considering the cases p = 2 or p = 1. The case p = 2 leads to a discretization 
of the Laplacian operator on Q when considering necessary conditions for the min- 
imization (5.1): denote this by L2G. The case p = 1 leads to total variation [51, 53]: 
denote this by L1G.? 


For many years, the almost automatic choices of regularization in (5.2) and (5.3) 
have been based on the f2-norm, i.e. p = 2. This yields a straightforward linear least 
squares problem that can be effectively solved even when the problem is very large 
(see, e.g. [32, 55]). Large computational problems are manageable even if F is nonlin- 
ear in u, and R is more complex but still £2-based (see, e.g. [16, 17, 29]). Furthermore, 
the £2-based regularization enjoys a favorable statistical interpretation for models 


1 Ofcourse, wavelet function bases do approximate derivatives as well. For instance, our distinction 
as such is particularly blurred by tight frame wavelets [7]. However, the distinction of L1 from L1G 
should be intuitively clear. Note also that one can always transform L1 and L2 by a change of vari- 
ables into a form where W becomes the identity. However, we retain our notational redundancy for 
convenience. 

2 Note that the gradient magnitude |V u| is the £2 norm of Vu. Thus, the L1G expression is one of 
a discrete {| norm only if d = 1. Also, a further regularization is required when using L1G upon 
considering necessary conditions for (5.1); see, e.g. [1]. 
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with a prior that is normally distributed [8, 41, 58, 61]. In the past two decades, how- 
ever, regularization methods based on the | -norm (i.e. p = 1 in (5.2) and (5.3)) have 
become immensely popular; see, e.g. the books [24, 46, 51]. In fact, we have been led 
to consider the idea that #ı-based techniques should altogether replace the simpler, 
faster and better known £2-based alternatives. There are two essential motivations for 
this exciting trend. 

e It is natural to choose for the regularization term a penalty function as in (5.3), 
thus expressing the a priori information that u(x) ought to be smooth. However, if 
u(x) has jump discontinuities, then using L2G essentially smears out such disconti- 
nuities because the Dirac 6-function is not square integrable. On the other hand, the 
6-function is integrable, and thus using L1G better accommodates jump discontinu- 
ities. 

e Whether the term R is aimed at penalizing the magnitude of the gradient or the 
solution itself, the 1 -based regularization tends to produce sparse approximations. 
In the L1G context, this is expressed in the observation that the reconstruction tends 
toward being piecewise constant, so the gradient is mostly zero and thus sparse. In 
the L1 wavelet (or DCT) approximation context, where Wu in (5.2) corresponds to coef- 
ficients of different wavelet (or cosine) basis functions, a compressed approximation 
involving only a few basis functions often results (unlike the case when using p = 2). 


The rather fundamental importance of the above two reasons for using p = 1 is 
not in doubt. Among many other researchers, we have contributed to this volume 
of work [1, 30, 36]. We have found that for well-conditioned problems with suffi- 
cient high-quality data,’ £\-based regularization can, in many cases, “deliver on its 
promise.” However, for problems with poor data, or ill-conditioned problems typi- 
cally resulting from discretizations of highly ill-posed problems, we have found that 
this is often not the case. To demonstrate and motivate the ensuing discussion, let us 
consider the following example. 


Example 5.1 (Image Deblurring). Let J be a discretization of a known image blurring 
operator and u be an image reshaped into a vector. Our goal is to recover the clean im- 
age given noisy blurred data. For the following numerical experiments, we have used 
three codes: (i) RestoreTools [33], which employs an L2-type recovery strategy (viz. 
p = 2and W = I in (5.2)); (ii) the GPSR package [26], which employs a wavelet L1 re- 
covery algorithm; and (iii) a straightforward total variation (L1G) code. The above two 
packages, in our opinion, are both excellent representations of good software for the 
problems they aim to solve. However, the L2 code requires, comparatively speaking, 


3 We further explain in Section 3 what we mean by the intuitive terms “high-quality” versus “poor” 
data. 
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Figure 5.1: The ground truth image (a) is blurred and corrupted by noise to create the data (b). 
Recovered solutions obtained for this data by RestoreTools (L2), GPSR (L1) and total variation (L1G) 
are displayed in (c-e), respectively. 


only a small fraction of computational time to terminate successfully, and hence it is 
to be preferred unless the L1 reconstructions are demonstrably better. 

The “true image,” or ground truth is a 128 x 128 MRI image from Matlab’s collec- 
tion. The blurring kernel is e~!*!2/20 with o = 0.01 and the blurred data is further 
corrupted by 1 % white noise. In all three methods, the data is fit to an accuracy of 1% 
by tuning the regularization parameter ß (see, e.g. [61]). The results are presented in 
Figure 5.1. 

It is apparent that, at least for this problem, the £;-based reconstructions do not 
yield more pleasing results than the simple f>-based one. The L1G image is typically 
blocky, and in the present context, it may be considered the worst of the three: indeed, 
sparsity of the surface gradient is not a good regularization objective here. The first 
two recoveries are more comparable in terms of quality. In fact, it may be argued that 
the L2 result is altogether better than the {| -based ones. 


Image deblurring is a favorite application in the literature for discussing and com- 
paring both L1 and L1G techniques. Indeed, in many such examples, {-based regu- 
larization is to be preferred (see, e.g. [11, 24, 36]). However, Example 5.1 is by no means 
esoteric. Furthermore, similar comparative observations arise when working on cer- 
tain nonlinear ill-posed problems such as electrical impedance tomography (EIT) and 
direct current (DC) resistivity [1]; we return to this in Section 4. 
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The goals of this paper are therefore to explore, bearing in mind the occasion- 
ally impressive advances of f}-based regularization techniques, also some of their 
limitations. Taking into account the often considerable added hardship in calculat- 
ing solutions of the resulting computational problems, {)-based techniques must of- 
fer substantial advantages to be worthwhile. In this light, our results suggest that in 
many applications, f>-based recovery may be preferred. To this end, we provide the 
following cautionary notes: 

(1) Only the left term in the objective function of (5.1) is really mandated by the stat- 
ed data fitting problem. The choice of regularization is discretionary: different 
choices may generally yield different solutions that as such must all be consid- 
ered acceptable. The further specification of regularization reflects a prior which 
depends on additional knowledge that may or may not be truly available. 

(2) It is not true that one must always seek a sparse approximate solution, especially 
if an appropriate basis to span the solution is not known. 

(3) Codes such as those reported in [3, 4, 26, 42], which perform well when applied 
in the context of using wavelets for denoising or deblurring, may occasionally 
perform relatively poorly when applied in a wider context. 

(4) In our experience, if the data is not of sufficiently high quality, in the sense that 
there is too much noise, then £;-based methods may occasionally perform worse 
than the corresponding f2-based methods. 

(5) If the data is not of sufficiently high-quality, in the sense that it is too sparse 
or rare, then fı-based methods may occasionally perform worse than the cor- 
responding £2-based methods. 

(6) If the computational problem is highly ill-conditioned, then £,-based methods 
may occasionally perform worse than the corresponding £>-based methods. 


In this paper, we explore examples, or case studies, which demonstrate the 
claims above and explain when £2-based methods merit prime consideration. Some 
analysis is also provided. We group our discussion into two classes: problems with 
poor data, considered in Section 3, and highly ill-conditioned problems, considered 
in Section 4. The latter section is far longer and more involved than the others, and 
Theorem 5.4, as well as the analysis in Section 4.1, are new. Before these, Section 2 
provides a quick review of fı-based regularization. We review the theory and the 
requisite assumptions necessary for £)-based recovery to perform well. 

Finally, we summarize the paper in Section 5. 


2 £,-based regularization 


Several books, e.g. [11, 24, 46, 51, 56], contain descriptions of f}-based regularization 
methods in the context mentioned earlier, and it is not our intention to reproduce 
them here. We only touch upon a few items. For early efforts in geophysics and data 
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assimilation, see [14, 59]. For advanced uses of such methods in machine learning, 
see, e.g. [47, 49]. 

In the context of a discrete cosine or a wavelet-type transform, the problem (5.1) 
may be viewed as a noisy version of the problem 


min R(u) , (5.4a) 
s.t. Ju = b, (5.4b) 


where J has a full row rank m < n. Note that this can be a well-conditioned problem 
for both choices of p in (5.2). For L1 (i.e. p = 1 in (5.2)), problem (5.4) can be cast as 
a linear programming problem, and linear programming theory already guarantees 
that there is an optimal basic feasible solution which is m-sparse (i.e. with only at 
most m nonzero components) [50, 62]. In contrast, when using L2, all components of 
the optimal u are typically nonzero. 

This has been well known since at least the 1960s. Moreover, though, since the 
above transforms utilize elaborate basis functions, it is reasonable to expect that 
much fewer than m basis functions may suffice, corresponding to a much sparser 
solution. The discovery [13, 20, 22] that using L1 often yields such a sparse solution, 
effectively solving a very hard combinatorial problem, is much newer and constitutes 
a major breakthrough. 

However, it is not always the case that the solution of the constrained optimiza- 
tion problem using the fı norm yields a sparse solution. Furthermore, for (5.1) in 
general, it does not automatically follow that if such a sparse solution exists, it is 
an appropriate estimate of the true solution, see [19] and Section 4.1. 

Much effort has been devoted to the question, namely, under what conditions the 
Lı solution of (5.4) produces the sparsest possible solution of (5.4b), referred to as the 
f solution. Of course, a more practical goal would probably be to seek a “sufficient- 
ly sparse” solution, but the quest for optimum in this regard sheds light on what is 
required more generally. The restricted isometry property (RIP) [9] and the null space 
property of [15, 21] both provide sufficient conditions, whereas the y-condition of [40] 
is both necessary and sufficient for obtaining the sparsest solution by L1. 

These conditions are of great value for understanding the design of compressed 
sensing methods. Unfortunately, though, for realistic instances of the matrix J, they 
are generally intractable (NP-hard) to verify numerically. Moreover, in Section 4.1, we 
show that such conditions are violated for a specific case of the inverse potential prob- 
lem when attempting to recover a pair of point charges by £ -based methods. 

The vector norm function || - ||» is well known to be convex only when p > 1. 
Thus, fı is marginally convex. Even more sparsity-inducing is the use of a noncon- 
vex norm with 0 < p < 1 [12, 44, 54]. However, there is a price to pay for lack of 
convexity, in terms of both poorer theory and the necessity of convergent algorithms 
which typically apply a continuation (homotopy) procedure starting from a convex 
Llp-norm. 
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Several famous codes cited earlier for solving (5.4) use methods that are based 
on gradient projection with acceleration (see, for instance, the extended Chapter 6 
of [5] and references therein). The advantage of these methods is that they extend 
directly to problems with nonsmooth constraints and require the objective function 
gradient to be only Lipschitz continuous. However, bear in mind that for solving sim- 
ple unconstrained convex quadratic problems, such methods boil down to acceler- 
ated gradient descent without preconditioning, generally thought to be unforgivably 
slow. These methods seem to work well for compressed sensing problems because the 
corresponding problems (5.4) are well-conditioned in an appropriate sense. Unfortu- 
nately, other applications involving, for instance, PDE-constrained optimization (as 
in Section 4), are highly ill-conditioned and therefore, similar numerical optimiza- 
tion methods should not be expected to be robust and efficient in the latter context. 

Total variation (L1G) has been discovered and peaked earlier than sparse wavelet 
basis reconstruction and compressed sensing. The books [11, 51, 61] and many papers 
develop both theory and algorithms using this approach. In practice, some regular- 
ization such as a Huber switching function [56] is often used, and this really gives 
a mix of fı with £2 elements while still retaining the L1G spirit [1]. See also [6] for 
another approach to round excessive L1G sharpness. Moreover, one popular iterative 
scheme to carry out the resulting algorithm is lagged diffusivity, which is a special 
case of iteratively reweighted least squares (IRLS) [1, 61]. 

Unlike the case for wavelet-type solutions, where a sparse representation is 
sought for the same high-quality surface or image approximation, here the regu- 
larization is applied directly to the surface variables to be recovered. Along with the 
advantage in directly penalizing piecewise smoothness, the tendency of the L1G reg- 
ularization to give sparse gradients, translating into a “blocky image,” is not always 
what one necessarily wants (see, e.g. Figure 5.1(e)) L1G penalizes large jumps in the 
solution more than small jumps, and this may introduce distortion in the reconstruct- 
ed surface. Various nonconvex alternatives to L1G are listed in [56], for instance, and 
these occasionally yield sharper results for some applications. However, the non- 
convex nature of these regularizations again leads to both theoretical and practical 
additional difficulties. 

Our focus in this article is on exploring situations where use of the L1 or L1G reg- 
ularization (p = 1 in (5.2), (5.3)) may reasonably be compared to use of L2 or L2G 
(p = 2 in (5.2), (5.3)). Therefore, employing any of the even sharper nonconvex op- 
tions mentioned above is not under further consideration. 

The above synopsis has been restricted to linear problems. There is very little 
Lı theory for nonlinear problems. Moreover, it is easy to see that some of the basic 
sparsity arguments fail for this case. Consider the problem 


min |lullı 
u 


s.t. F(u) =b, 
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Figure 5.2: When the constraint (solid) is nonlinear, it does not need to intersect the level set of 
|lullı (dashed) at a vertex, so the solution is not necessarily sparse. 


where the forward mapping function F : R” > R” is smooth and has significant 
curvature (see Figure 5.2). In such a case, the problem need not even have m-sparse 
solutions; indeed, the optimal solution may have n nonzero entries. Thus, the justifi- 
cation of using L1 for nonlinear problems is far from obvious. On the other hand, L1G 
is interesting because of its sharpening property. In Section 4.3, we explore the use of 
L1G for a particular popular nonlinear case study. 


3 Poor data 


The perceived quality of a given data set depends on several factors, and not sim- 
ply on some idealized noise level. One of these is the inverse problem operator. For 
instance, in Example 5.1, the deblurring operation, which is essentially to improve 
contrast and sharpness of the image, counters an image smoothing operation which 
aims to remove noise. Thus, a noise level in the data which may otherwise be consid- 
ered benign (say, in a pure denoising application) can be an important obstruction 
here. 

In the context of data fitting, it has been known for decades that £; data fitting is 
more robust than {2 against outliers in the data. See, for instance, [50] and also [45] 
for a recent use in the context of 3D graphics. However, such a comparative statement 
does not necessarily hold true for other types of noise such as white noise. 

In general, bearing in mind the additional complications in carrying out £;-based 
regularization, the data must be of sufficiently high-quality to allow its favorable 
properties (when relevant) to be expressed. A common situation yielding lack of suf- 
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ficiently good data is when the data is relatively rare, being given only at relatively 
few locations in Q. Let us next discuss a simple example where the data locations are 
rare (or sparse) in the domain of the definition. 


Example 5.2 (Rare data reconstruction of piecewise smooth functions). Consider the 
recovery of a (real) signal u* (t) on [0,1] from m noisy samples u; ~ u* (ti), and 
assume we know that u* is piecewise smooth, but may have jump discontinuities. We 
discretize the interval [0, 1] with a uniform grid ofn = 512 points, and use in a given 
experiment a subset of m < n samples taken at random points t; from this grid. The 
integral appearing in (5.3) is discretized using a piecewise linear function u(t) on 
all n grid points. Thus, the recovery problem is formulated as in (5.1), with J being 
the m x n matrix consisting of m columns forming an identity matrix interspersed 
with n — m zero columns. In the limit case of no noise, the formulation (5.4) yields 
interpolation through the data points (ti, ui) of the sample. 

We compare L2G and L1G regularizations. It is easy to verify that in the L2G case, 
these data points are connected by straight lines, whereas with L1G (total variation) 
regularization, the behavior is indeterminate, only restricting u to be monotone. 

Figures 5.3 and 5.4 depict reconstruction results for m = 9 and m = 28 samples. 
The ground truth signal u* (t) contains two jumps, and we added 5 % Gaussian noise 
to the selected values to form the corresponding data sets. Figure 5.3 shows the result 
for 9 samples, with the regularization parameters tuned by the discrepancy principle 
to obtain a data misfit of 5 + 0.1%. There is little difference between the L1G and L2G 
reconstructions. 

The reconstruction in Figure 5.4 (a) using 28 samples starts to show the advan- 
tages of L1G. Because the data contains two samples across the right discontinuity, 
the regularization parameter 8,2, now had to be decreased to Bi2g = .002 in order to 
obtain the desired misfit of roughly 5%. As a result, the L2G reconstruction exhibits 
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Figure 5.3: Reconstructions of a piecewise smooth function from a few noisy samples: using L2G and 
L1G for m =9 data pairs, with Bi2g = 0.04, Prig = 0.08. 
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Figure 5.4: Reconstructions of a piecewise smooth function from a few noisy samples: using L2G and 
L1G for m=28 data pairs. (a) F126 = 0.002, Brig = 0.08, (b) Bi2g = 0.02, Bric = 0.08 


considerable oscillation in the flat sections, although note that the second jump is 
reproduced as well as by the L1G method. In Figure 5.4 (b), we increased By, until 
the flat sections became reasonably smooth according to the “eyeball norm.” Observe 
that the oscillation has disappeared, but the second jump is now completely blurred 
as well. 


Example 5.2 illustrates that L1G regularization performs well when there is 
enough quality data to require the reconstructed model to have discontinuities. How- 
ever, when the data is “too sparse”, L2G regularization performs as well as L1G, 
even in the presence of discontinuities in the underlying ground truth function. This 
lesson seems perhaps obvious in hindsight. However, it extends to more complex 
situations where the insight is no longer so obvious. For instance, the problems con- 
sidered in Section 4 have data specified only at the boundary of a given physical 
domain Q, which is a lower-dimensional manifold; several examples can be found 
in the literature where some L1G variant is applied to such problems. For another 
instance, consider a point cloud in 3D, obtained as a set of somewhat noisy and not 
very dense 3D laser scan measurements of a body with edges, such as a desk corner. 
In order to obtain a good surface reconstruction, we need at each point the normal to 
the surface that the (cleaned) point cloud represents [38]. Since the curvature across 
an edge is infinite, the data can be effectively very sparse there, and indeed a global 
lı -reconstruction approach [2] might not work well then See Figure 10 in [39] for 
such an example. Poor data are often encountered in ocean and atmospheric data 
assimilation, as well as in other time-dependent geophysical applications [23, 27]. 
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4 Large, highly ill-conditioned problems 


In this section, we consider applying £1 -based techniques to large, highly ill-condi- 
tioned problems that typically arise in applications involving PDE-constrained opti- 
mization. As a first example, we show in § 4.1 that L1 techniques may not only be 
expensive to carry out, but also have difficulty in producing solutions which are as 
sparse as a given ground truth. In § 4.2, we then supply some analytical evidence sup- 
porting this observation. Finally, in § 4.3, we show by another example that while L1G 
is not nearly as severely afflicted as L1, its advantage over L2G in recovering surface 
discontinuities requires favorable conditions to shine through. 


4.1 Inverse potential problem 


In the inverse potential problem, one seeks to recover an electrical source distribution 
in a given domain © from measurements of the potential on the domain’s boundary. 
This problem arises in EEG source modeling [48] and in electromyography [18, 19]. In 
[19], the sought source is a combination of discrete tripoles corresponding to muscle 
fibers, and as such invites a sparse reconstruction. However, in 3D, the computational 
problem using L1 indeed became much too large and difficult to work with, and our 
eventual success in solving the research problem stated in [19] followed a further re- 
alization that, given the specific goals of those computations, the sparse view was not 
the most effective. This has left the question open regarding what an L1 reconstruc- 
tion can do for such a problem (regardless of cost), a question that we now proceed to 
explore in a more manageable 2D context, with Q being the unit square. 
The forward model 
—Av=u(x), xEQ, (5.5) 


with Neumann boundary conditions on v, predicts the potential v for given electrical 
source u. The total charge must be zero due to these boundary conditions. Note that 
v(x) is only determined up to an overall additive constant, reflecting the physical 
principle that only a potential difference is physically meaningful. 

The inverse problem of finding u from values of v on the boundary does not 
have a unique solution, even under idealized conditions [35]. The best one can do is 
construct an “equivalent source” u that explains the data. Such a reconstruction gives 
incomplete, though still useful, information about the actual source. Hence, the role 
of regularization is to provide additional information leading to a distribution u that 
conforms to prior expectations, a rather fundamental difference from sparse signal 
reconstruction. Denoting the discretized Poisson operator of (5.5) by A and the data 
projection operator by Q (see [19] for details), we obtain a problem in the form (5.1) 
with 

J=0A". (5.6) 
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Figure 5.5: Reconstructions of a piecewise constant charge distribution from boundary data. (a) True, 
(b) L2G, (c) L1G. 
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Figure 5.6: Reconstructions of a smoothed-step charge distribution from boundary data. (a) True, 
(b) L2G, (c) L1G 


Example 5.3 (Inverse potential problem). In this numerical experiment, the support 
of the source u is restricted to the offset inner square (assumed to be known in the re- 
construction) as depicted in Figures 5.5-5.7. The potential is measured on the bound- 
ary, taking the average boundary potential as the ground level, i.e. we subtract the 
average boundary potential from each datum. This is necessary as only potential dif- 
ferences are measurable. Figures 5.5-5.7 depict results for three different source dis- 
tributions in the region. In each case, synthetic data is computed on a 64? grid, to 
which we add a 1 % Gaussian noise. The reconstruction is done with our various reg- 
ularizations (5.3) and (5.2) on a 32? grid. The regularization constant ß is tuned to 
obtain a resulting misfit of 1 + 0.1% (see, e.g. [61]). 

Figures 5.5 and 5.6 serve as an appetizer We consider, respectively, piecewise 
constant and smoothed-step dipole distributions. Observe that the L1G reconstruc- 
tion results in a well-defined interface between the positively and negatively charged 
regions, whereas the L2G reconstruction is smooth, irrespective of the true model. 
As such, the use of L1G is especially recommended if we know a priori that u is 
piecewise smooth. However, it is not possible to determine from the reconstructions 
whether u has a jump or not: notice the similarity between Figures 5.5 (b) and 5.6 (b), 
and that between Figures 5.5 (c) and 5.6 (c). 

Next, we explore the main theme of this section by considering a point charge 
pair. The true model (ground truth) depicted in Figure 5.7 (a) is now very sparse. 
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Figure 5.7: Reconstructions of a point charge pair from boundary data using gradient regularization. 
(a) True, (b) L2G, (c) L1G. 
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Figure 5.8: Reconstructions of a point charge pair from boundary data using regularizations (5.2). 
(a) L2, (b) L1, (c) Weighted L1 


The results shown in Figure 5.7 are similar to those in Figures 5.5 and 5.6. The 
dipole structure is apparent from the L2G and L1G reconstructions, though not much 
more is. The L1G reconstruction hints at a dipole pair, but may mislead one to infer 
an incorrect orientation. 

For this last source distribution, a sparse reconstruction seems natural, and one 
such, obtained using an L1 regularization, is depicted in Figure 5.8 (b). The L2 recon- 
struction is depicted in Figure 5.8 (a) for comparison. We see that the L1 reconstruc- 
tion is somewhat sparse, but all the reconstructed sources are on the boundary of the 
support of u (x), and the L1 solution is not as sparse as the true model. 

The reason for the observed source distribution is that sources near the detector 
affect the data more and are therefore favored [31]. This effect can be reduced by a lo- 
cation dependent reweighting of the regularization function as suggested in [28, 43], 
which amounts to normalizing the columns of J to unit 2-norm. Letting 


i 1/2 
aj = (Sun) Jij=Jijlaj Wi = aiui, 
i=l 


we can write Ju = bas Jü = b and apply the L1 regularization to ü. (Note though 
that computing a; for large scale problems may not be practical.) The resulting re- 
construction is depicted in Figure 5.8 (c). The sparsity has improved a little, but we 
are still far from the £o solution. 
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For this example, since J has normalized columns, the famous RIP condition de- 
fined and analyzed in [9] applies. This condition requires that there bea ô < v2 — 1 
such that for all 4-sparse u, we have 


(1 — 6) lüll} < IJul$ < (1 + 6) las . (5.7) 


However, here it can be shown to be violated on physical grounds. Let u be a 4-sparse 
source, i.e. nonzero only for indices iin a set T with |T | = 4, and further, let it have 
values +1, so 
als = > a? > 4min (aj) >0. 
ieT 

(The value a; is just the 2-norm of the boundary potential when a unit source is placed 
at location i.) Note that || Jull3 is the £2 norm of the boundary potential. By placing 
the positive and negative charges very close together, so that they almost cancel each 
other, we can make the boundary potential and thereby ||Jul|> arbitrary small, and 
thus 6 becomes arbitrarily close to 1. Hence, the RIP condition is violated. Note that 
this does not prove that the sparsest solution cannot be obtained, as the RIP is a suf- 
ficient, though not necessary condition. 

The necessary and sufficient y-condition of [40] for obtaining the £o solution 
from the £; solution relies on properties of the solution y to the equation 


(JTY)i = Zi, (5.8) 


for selected indices i such that z; # 0. In our case, to determine if it is possible to 
recover a 2-sparse source, the n-vector z should be 2-sparse with entries +1, so (5.8) 
has just two equations. Further, J’ y = A~!Q’y, and we can interpret y as describing 
electrical sources on the boundary only, such that the generated potential equals 1 
at point pı and —1 at point p2. These correspond to the location of the point charges 
described by z. The y-condition then implies that we can find a y such that the poten- 
tial JTy is between —1 and 1 everywhere else. Unfortunately, however, on physical 
grounds, we can see that this is not possible. To see this note that if we place pı and 
p2 very close together, then a very large electrical field will exist between the points, 
which must be caused by very large boundary sources, which in turn will generate 
close to those sources an even larger (> 1) field. Analytically, we observe that in the 
continuum limit, since z is a harmonic function, it must take its extreme values on 
the boundary. Since it takes on values +1 inside, it must take on larger values on the 
boundary, and hence the y-condition is violated. 


4.2 The effect of ill-conditioning on L1 regularization 


In this subsection, we consider the regularized L1 problem 


wo L 
min > IJu-bll; + BllWullı , (5.9) 
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and show, for aspecial choice of W, which in a sense favors sparsity, thatin the highly 
ill-conditioned case and in the presence of noise, the correct sparsity of a ground truth 
model can be recovered only if the singular values of J and the sparsity structure com- 
bine in a beneficial manner. This helps explain the negative results of Example 5.3. 
Let the singular value decomposition (SVD) of the m x n matrix J be given by 


J =USV', 
where U and V are orthogonal matrices and = = diag {01,..., Om} is m xn with the 
singular values ordered so that 0; > 02 > --- > Om. Further, consider a true model 
u* such that z* = VTu* satisfies 
1 ieT 
z = j . (5.10) 
0 iET 


The emphasis in (5.10) is on the nature of TJ, i.e. the sparsity: setting the nonzero 
values to 1 is just for convenience. For notational simplicity, let us also assume, with- 
out loss of generality, that U = I, the identity. Then, it also makes sense to consider 
the case where z;* = 0, i > m. Suppose further that the data b is contaminated by 
Gaussian noise € with mean 0 and covariance p?I. We have 


b = Xz* +e. 


Thus, fori € T,z* = (bj — €;)/oj = 1. 

Turning to approximate solutions and setting z = VTu, recall first the truncated 
SVD method, even though it has nothing to do with L1 methods. Thus, we set 6 = 0 
in (5.9), obtaining the least squares problem 


1 
min = ||=z — bills , (5.11) 
z 2 


and then, since the noise €; is obviously magnified by we" we set 


b,/o; isr 
” -| ee, (5.12) 
0 i>r 


where the effective rank r, r < m, is such that the error term depending on o; l 
has tolerable size. Using this regularization method, it is obvious that a necessary 
and sufficient condition for obtaining the same sparsity for z and z* is that T = 
{1,2,...,r}. Indeed, no (very) small singular value index can be tolerated in the 
set T of the given true model. In particular, we cannot stably obtain the sparse ap- 
proximate solution for just any true model. This requirement becomes rather restric- 
tive in the highly ill-conditioned case, where r < m. 

Of course, the truncated SVD method not only does not have L1 magic, it also 
requires carrying out the SVD, something we wish to avoid for the large problems 
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considered in this section. Let us now return to the Tikhonov-type method (5.9) with 
B > 0, and consider the special case of the L1 approach with W = V". This special 
case is in a sense the most favorable for the sparsity-inducing algorithm to work well. 
This is so because the subspace defined by =z = b has the best possible orientation, 
with respect to the faces of the polyhedron ||z||ı = constant, to cause intersection 
at a face corresponding to the correct sparsity. See, for example, Figure 1 in [10] for 
a sparsity spoiling orientation that cannot occur in our case. So, if we encounter dif- 
ficulties caused by ill-conditioning in this special case, then they will persist upon 
using a more general W. 
Thus, we are considering the problem 


| 
min 5 Sz - bil} + Bllzlı . (5.13) 


Because (5.13) is just a sum of decoupled terms, we can solve it explicitly for each 
component of z. The solution has z; = 0 where the gradient of the data fitting term is 
bounded by the gradient of the regularization term, which gives 
B = |oi(o:z} + ei) 
Otherwise, 
Zi = ((oizj + €i) + P/01)/ 0i, 
where the sign in front of ß is not needed for our purposes. 
In order for z to have the same sparsity as Z*, we therefore must have 
B <|oi(oi+e¢;)| forieT, 
B = lo:eil fori¢T. 
Squaring these inequalities and replacing e? by its expected value p? gives the con- 
dition 
2,2 2 72,72 2 
maxp‘o; < p^ < minoj (of + ; 
macso B* < min oj; (of + p°) 


Thus, the regularization parameter ß must satisfy 


po. <B<o_Vo2+p2, (5.14a) 


0+ = maxi, 0-=mino;. (5.14b) 
i¢T 


with 

ieT 
From (5.14), it follows that the correct sparsity pattern can be comfortably recovered 
ifo; < o, i.e. if all small singular values are not in J and all others are in 7 , just 
as for the truncated SVD method. 


The case where L1 may offer potential advantage over truncated SVD is when 
o+ > o_. In this case, (5.14a) yields the requirement 


pe, (5.15) 
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We summarize this as follows: 


Theorem 5.4. Consider the L1 regularization problem (5.9). For the specific case de- 
fined above using (5.10), (5.13) and (5.14b), the true and reconstructed models, z* and z, 
are expected to have the same zero structure only if either 0. < o- or (5.15) holds. 


Unfortunately, if o- « 1 and/or o+ > o_, then the condition (5.15) may be too 
restrictive in practice, possibly holding only for an unrealistically small noise level. 

Further difficulties arise upon considering the usual practical process of selecting 
the regularization parameter ß by the discrepancy principle (see, e.g. [61]), i.e. such 
that the total misfit u satisfies 


= (0% (21-28) ei)’ =. 


Let us next compute the misfit for ß satisfying (5.14a), assuming p is such that this is 
possible, i.e. one of the conditions of Theorem 5.4 holds, and show that the misfit can 
easily be much too large in the ill-conditioned case. Conversely, this means that if £ 
was selected by the discrepancy principle, condition (5.14a) would be violated. 

Let us choose ß = po, i.e. the smallest possible £ satisfying (5.14a). Replacing 
ë by its expected value, the expected misfit squared becomes 


1 
| y p? + 5 pote?) . 


i¢T ,i<m ieT 
The discrepancy principle requirement u = p can now be written as 


1 Y o2je2 #1. 

IT| 2 +/ i 

However if J is ill-conditioned, the mean value of o2/ a? over the set T could be 
very large, implying that ß (chosen to recover the correct sparsity) is too large to sat- 
isfy the discrepancy principle. Conversely, the value of £ selected by the discrepancy 
principle will be too small to recover the correct sparsity of z*. 

It is important to emphasize that we do not claim that L1 variants cannot work 
for highly ill-conditioned problems. Rather, they may not necessarily work. It all de- 
pends on how the sparsity of the true solution T and the singular values of J relate. 
Moreover, we do not know of a method that does better than L1 in the present sense. 
However then, our expectations regarding sparsity are lower for most other methods 
in the first place. 
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4.3 Nonlinear, highly ill-posed examples 


In this subsection, we study the DC resistivity problem on the unit square. The forward 
problem for v, given by 


- V - (a(u)(x)Vv) =q(x), XEQ, (5.16) 


subject to Neumann boundary conditions, predicts the potential v for given external 
source q and conductivity o (parameterized in terms of u). The inverse problem is to 
recover the conductivity o (u) from partial measurements of the potential vt, when 
different current patterns qt, i = 1,...,s, are injected into the region. 

For experiment i, qÍ consists of a positive point source on the left boundary and 
an opposite point source on the right boundary, and thus 


qi(x) =6 i, — OL ins 
R 


where pi and pË are located on the left and right boundaries. Different data sets are 
obtained by varying the positions pi and pË of the two opposing point sources. We 
place each at ys equidistant points including the corners, in all possible combina- 
tions, which gives a total of s data sets for a perfect square. Voltage is measured on 
the boundary, so the number of point in each data set is the number of boundary 
points of the discretization mesh. See [17, 52] and references therein for details of the 
problem setup such as the discretization of (5.16) and the solution of the resulting 
optimization problem. 

For this nonlinear inverse problem, it is well-known that, unlike for the inverse 
potential problem, increasing the number of data sets s allows a more accurate re- 
covery of the resistivity 1/0. There is no reason to apply L1 here, and the purpose of 
the following experiments is to determine, for a piecewise continuous surface recov- 
ery, roughly at what point of such computational refinement the L1G regularization 
becomes worthwhile. 


Example 5.5 (EIT and DC-resistivity). We have chosen to recover a grid approxima- 
tion u of 


ux)=P!(o(x)), (5.17) 
where the transfer function 
1 1 
P(t) = > (Omax — Omin)tanh(t) + > (Omax + Omin) (5.17b) 


enforces a priori known upper and lower bounds on the possible conductivity. 

A synthetic conductivity model is used to compute the data b, which is calculated 
on a grid that is twice as fine as the grid used for the reconstruction, and either 3 % or 
1% Gaussian noise is added to it. 

The ground truth model used to synthesize data consists of an object with con- 
ductivity o = 1 (black) placed in a background of conductivity oa = 10 (white); 


The lost honor of {2-based regularization — 199 


(a) (c) 
10 10 10 
9 9 9 
8 8 8 
7 7 7 
6 6 6 
5 5 5 
4 4 4 
3 3 3 
2 2 2 
1 1 1 
(d) 
10 10 
9 9 
8 8 
7 7 
6 6 
5 5 
4 4 
3 3 
2 2 
1 1 


Figure 5.9: Conductivity reconstructions for different numbers s of data sets with noise level 3 %. 
(a) True model, (b) s =4, L2G, (c) s = 4, L1G,(d) s = 64, L2G, (e) s = 64, L1G. 


see Figure 5.9 (a). In (5.17b), we set Omin = 1 and Gmax = 10. The inverse prob- 
lem involves minimizing expressions of the form (5.1), (5.3). We compare p = 1 (total 
variation, or L1G) with p = 2 (L2G). A 128? uniform grid is used in these calculations. 

Figure 5.9 shows the obtained reconstructions using s = 4 and s = 64 current 
configurations at a noise level of 3%. The regularization parameter 6 was tuned to 
result in a misfit of 3 + 0.1%. Observe that in the case of rare data s = 4, there appears 
to be no advantage to using the L1G regularization, whereas with 64 data sets the L1G 
reconstruction is only marginally better than L2G. 

Next, we use s = 1024 data sets at a noise level of 1%, with B correspondingly 
tuned. In order to accommodate so many right-hand sides, we employ the stochastic 
adaptive algorithm described in [17]. The results are depicted in Figure 5.10. At this in- 
creased model accuracy and resolution, the result obtained using L1G is clearly better 
than that obtained using L2G. 


The situation described in Example 5.5 is not uncommon in practice. Often in geo- 
physical experiments, results of the sort depicted in Figure 5.9 (d,e) are of sufficient 
quality and the lower noise level and larger number of experiments s required for 
obtaining the result in Figure 5.10 (b) is a sort of luxury that is not always attained. 
Moreover, the forward problem considered in this section is often indicative of what 
is observed numerically, also for more complex problems such as low frequency elec- 
tromagnetic and seismic data inversions. Finally, weighted L2G variants that are rou- 
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Figure 5.10: Reconstructions for a larger number of data sets s = 1024 and with the noise level at 
only 1%. Here, L1G clearly outshines L2G. (a) s=1024, L2G, (b) s=1024, L1G. 


tinely used in geophysical applications may further improve reconstructions without 
resorting to £\-based regularization. In view of the occasionally significantly high- 
er cost of computing with L1G, it cannot be automatically concluded that the latter is 
worthwhile for this application, although it is a viable option that we always entertain 
in the course of our research. 


5 Summary 


In this paper, we have investigated the relative performance of £)-based regulariza- 
tion techniques on several examples and case studies. We have shown cases where 
such methods are worse than {2-based ones in the sense of costing more without de- 
livering more (Examples 5.1 and 5.5), and other cases where such methods produce 
better results (see Figures 5.4b and 5.10). Further, we have shown cases where the 
f,-based results appear to be more misleading than corresponding f2-based results 
(Example 5.3). 

In Section 4.2, we have analyzed the effect of ill-conditioning on the ability of 
an L1 method to correctly recover solution sparsity. Theorem 5.4 and the arguments 
following it suggest severe limitations in case of extreme ill-conditioning that perhaps 
arises in certain inverse problems. 

The results in Section 4.3 demonstrate how and when L1G becomes favored as 
the quality of the data improves. This in itself is intuitively expected, but less clear 
is where the crossover point occurs in realistic situations. Unfortunately, we had to 
tweak the problem beyond what may be expected in many geophysical situations in 
order to observe the L1G takeover. 

Let us again stress our overall conviction that the swing of the pendulum in recent 
years towards f1 -based techniques is rather important and not merely refreshing. Our 
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purpose here, far from opposing this trend, is to simply suggest that this virtual pen- 
dulum should not swing too far and away, to realms beyond reason. To this end, we 
note the following. 


In many situations, {-based regularization is well-worth using. Such techniques 
can provide exciting advances (e.g. in model reduction, computer graphics, im- 
age processing and reconstruction of surfaces with discontinuities). 

However, such techniques are not good for all problems, and it is dangerous (and 
may consume many student-years) to apply them blindly. 

In practice, we recommend to always consider first using f2-based regularization 
techniques because they are simpler, more easy to compute with, and do not in- 
troduce nonlinearities or lower smoothness. Only upon deciding that these are 
not sufficiently good for the given application, it is highly advisable to proceed to 
examine {| -based alternatives (when this makes sense). 

Last but not least, the possibility of combining f} - and 2-based techniques sug- 
gests itself. We have already commented on using the Huber switching function 
as well as IRLS techniques [1, 30, 33, 56, 61] for this purpose in the L1G-L2G con- 
text, but these ideas are also very popular in the image processing and computer 
vision literature in mixing the L1 and L2 approaches [34]. Another popular ap- 
proach is to employ an empirical Bayesian framework in order to learn an appro- 
priate mix [37, 57]. 
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